Table of contents : Contents......Page 3 Tables, figures, exhibits, and boxes......Page 8 Preface......Page 19 The autho
1,653 275 131MB
English Pages 448 Year 2009
QUANTITATIV DATAANALYSIS rch I Resea D oingS ocia to Testldeas
DO N AL D I. TRE IMA N
If i?j[i,i:l[fri:,
reserved' Copyright@2009by JohnWiley & Sons'Inc All dghts by JosseYBass Published
com cA 941O3wwwjossevbass ftltijt?tlltJ,l'",, t"' Francisco, form stored in a retrieval system' or tansmitted in any No part of this publication may b€ reproduced' exceptas oth"*ise' ol t:o:lt:q. or bv anv means,elecfonic, mechatucal,photocopying'recording theprior either Act'without
;:#;i
ffi;;!;".1b;i
ul'iei s'ut"'copvright
roa
"r aulori^tion trttougrtpuy"ni oftheappropriatep"t:1oP1*" "' wrinen Dermissionof the putrisrrer,or 'i]"'iiie (e78) 7508400' oiuq nu"'l MA'ore23' ;;;;'I*: il;i""i iliilt:;;;ii should permission for publisher the t n"q*t" '?;;il"*ooa o. onttn"ut *t* fax (978)6468600, NJ Hoboken' "clt stree! "oiyig;' River l,1l Inc : to thePer.ir.ion. o"ptii!'ni,i"rt^ wii"y n Sons' be addressed ssrons. www.wiley.com/so/pen at online or oiriid, iii,1j i1d_oor1,fax 201,744_6008, ascitationsor sourcesfor further information Readersshouldbe awarethat InternetWebsitesoffered waswrittenandwhenit is rcad' this time the between disappeared .. ."1 t """.ftag"a publisherandauthorhaveusedtheir bestefforts Limit of Liability/Disclaimer of warranty: while the or com wi*l respeclto lhe accuracy or lhis book.Lheymakeno repre'enlations wafianlie' in DreDaring or merchanrabil' warranties implied rr,i. roor #i ,fi"iri.aiiy dictaimany ;iJ";K, ;i ;;;.;;;;;,,'oi
ffi;il;?;;ili."iu,'pttp"t"
n'*Lantvmavbecreatedorextendedtysalesrei::il
I The aivice and strategies contained herein maynot ,"1* .it"tials .i *iii." nor author shall publisher the leither upp.op;ut". ation.you should consutt wltt, a protessiinut*fi.." to special' limited not but including oit*'' be liable for any loss of p.ot t o' "ommerciJdamages' ydamaBes' or other con(equential. rncidental.
most bookstores TocontactJosseyBassdirectl) JosseyBass books and products are availablelhrough the United *itio if," Unitla Star". ur 1a0O)956?739' outside call our CusromerCar" u"p*"n, (317) 5724002' Siatesat (3ll) 5723986' oi via fa'x at formats some content that appearsin JossevBassalso publishesits books in a variety ofelecftonic print may not be ivailable in electronic books' Library of Congress Cataloging'inPublication
Data
Donald J. Treiman, jutu unalysis : doing social researchto test ideas/ Donald J Treiman d"tl[G D, Cm,
2.Sociorogvf,esearchstatist "liJJj;l.T'3;:3:;:,t3*"f33?,*,n"^"thods. methodsComputer + Socialsciencesstatistical
methods. 3. Sociologystatisticar "if'oOt programs. 5. Stata. I Title HA29.T675 2008 300;724c22 Printed in the United StatesofAmerica FIRST EDITION
PB Printing
l0 9 8 7 6 5 '1 3 I I
20080131:v
*fq$ Tg$XT'{. fables, Figur€s,Exhibits. and Boxes
Xi
Preface
xxiii
The Author
xxvii
Introduction CROSSTAB U LATIONS What This ChapterIs About Introductionto the Book via a ConcreteExample CrossTabulations What This ChapterHas Shown MORE ON TABLES What This ChapterIs About The Logic of Elaboration SuppressorVariables Additive and InteractionEffects Direct Standardization
xxix 1 1 2 8 19 21 z1 22 ).) 26 28
A Final Note on StatisticalControlsVersusExperiments What This ChapterHas Shown STILLMORE ON TABLES What This ChapterIs About ReorganizingTablesto Extract New Information When to Percentagea Table "Backwards"
45 47 47 48 50
CrossTabulations in Which the DependentVariable Is Representedby a Mean Writing About CrossTabulations
52 58 61
What This ChapterHas Shown
o1
Index of Dissimilarity
Vl
Contents
4 ON THEMANIPULATION OFDATABYCOMPUTER
o)
What This ChaprerIs Abour
tr)
Introduction
66
How Data Files Are Organized Transforming Data What This ChapterHas Shown Appendix 4.A
Doing Analysis Using Stata Tips on Doing Analysis Using Stata Someparticularly Useful Stata 10.0Commands
INTRODUCTIONTO CORRELATION AND REGRESSION (ORDINARYLEASTSQUARES) What This ChapterIs About Introduction Quantifying the Size of a Relationship:RegressionAnalysis Assessingthe Strengthof a Relationship: CorrelationAnalysis The RelationshipBetweenCorrelation and RegressionCoefficients FactorsAffecting the Size of Correlation(and Regression)Coeflicients CorrelationRatios What This ChapterHas Shown 6
INTRODUCTIONTO MULTIPLE CORRELATION AND REGRESSION (ORDINARYLEASTSQUARES) What This ChapterIs About .
Introduction A WorkedExample:The Determinants of Literacy in China Dummy Variables A Strategyfor ComparisonsAcross Grouos A BayesianAlternativefor Comparing Models IndependentValidation What This ChapterHas Shown
MULTIPLE REGRESSION TRICKs: TECHNIQUES FOR HANDLING SPECIAL ANALYTIC PROBLEMS What This ChapterIs About NonlinearTransformations
OI
72 80 80 80 84
87 87 88 89 o1
94 94 99 102
r03 103 104 113 120 124 133 135 136
139 139 140
contentsVii Tesrin,ethe Equality of Coefficients TrendAnalysis: Testingthe Assumption of Linearity LrnearSplines Lrpressing Coefficientsas Deviationsfrom
MULTIPLEIMPUTATIONOF MISSING DATA \\tar This ChapterIs About lntroduction \ WorkedExample:The Effect of Cultural Capital on EducationalAttainmentin Russia \\hat This ChaprerHas Shown SAMPLEDESIGNAND SURVEYESTIMATION \\har This ChapterIs About SurveySamples Conclusion \nlar This ChapterHas Shown REGRESSION DIAGNOSTICS what This ChapterIs About Introduction A WorkedExample:SocietalDifferences in StatusAttainment RobustRegression
' ! 1 SCALECONSTRUCTION What This ChapterIs About Introduction
149 152
the
Grald Mean (Multiple ClassificationAnalysis) OrherWaysof RepresentingDummy Variables Decomposingthe DifferenceBetween Two Means \\'har This ChapterHas Shown
Bootstrappingand StandardErrors What This ChapterHas Shown
147
r64 166 172 179 181 181 \82 187 194 195 t95 196 )t7
224 225 225 226 229 237 238 240 241 241 1,41
Validiry Reliability
242 243
Vlll
12
Contents ScaleConstruction
246
ErrorsinVariablesRegression What This Chapter Has Shown
258
LOGLINEARANALYSIS What This ChapterIs About Introduction Choosinga PrefenedModel ParsimoniousModels A Bibliographic Note What This ChapterHas Shown Appendix 12.A Derivation of the Effect parameters Appendix 12.8 Introductionto Maximum Likelihood Estimation Mean of a Normal Distribution LogLinear Parameters
,'3
BINOMIAL LOGISTICREGRESSION What This ChapterIs About Introduction Relationto LogLinearAnalysis
261
263 263 264 265 277 294 295 295 297 298 299 301 301 302 303
A WorkedLogistic RegressionExample: PredictingPrevalenceof Armed Threats A SecondWorkedExample:SchoolingprogressionRatiosin Japan
304 314
A Third WorkedExample (DiscreteTimeHazard_Rate Models): Age at First Marriage
318
A FourthWorkedExample(CaseControlModels): Who WasAppointed to a Nomenklataraposition in Russia? What This ChapterHas Shown Appendix l3.A Some Algebra for Logs and Exponents Appendix 13.8 Introduction to probit Analvsis
327 329 330 330
14 MULTINOMIAL AND ORDINALLOGISTIC REGRESSION AND TOBITREGRESSION WhatThisChapterIs About Muhinomial LogirAnalysis
335 J J.)
336
Contents lX frinal
Logistic Regression
342
Tobit Regression(andAllied Procedures)for Censored DependentVariables Otter Models for the Analysis of Limited DependentVariables &'hat This ChapterHas Shown
t5
353 360 361
IMPROVINGCAUSAL INFERENCE: FIXED EFFECTS AND RANDOM EFFECTS MODELING What This ChapterIs About Introduction Frxed Effects Models for Continuous Variables RandomEffects Models for ContinuousVariables A Worked Example: The Determinants of Income in China Fired Effects Models for Binary Outcomes A Bibliographic Note Wtat This ChapterHasShown
363 363 364 365 371 372 375 380 380
16 FINALTHOUGHTS AND FUTURE DIRECTIONS:
RESEARCH DESIGN AND INTERPRETATION ISSUES whar rhis Chapter is About ResearchDesignIssues The Importanceof Probability Sampling A Final Note: Good ProfessionalPractice What This ChaDterHas Shown
38r 381 382 397 400 405
Appendix A: Data Descriptions and Download Locations fot lie Data Used in This Book
407
Appendix B: Survey Estimation with the General Social Survey
4',11
References
417
lndex
431
':,,::,li::1,i' ;.l.ll LiFl,..,
a:.x:X Ii:::.i,:;,,*rXf":* i::'.,:: i, TABLES I .1.
Joint FrequencyDisrributionof Militancy by Religiosity Among UrbanNegroesin the U.S., 1964.
1.2.
PercentMilitant by ReligiosityAmongUrbanNegroes in the U.S., 1964.
10
PercentageDistribution of Religiosity by EducationalAttainment, UrbanNegroesin the U.S., 1964.
l3
PercentMilitant by EducationalAttainment,Urban Negroes in the u.s., 1964.
l3
PercentMilitant by Religiosity and EducationalAttainment, UrbanNegroesin the U.S., 1964.
15
PercentMilitant by Religiosity and EducationalAttainment, Urban Negroesin the U.S., 1964(ThreeDimensionalFormat).
18
PercentageWho Believe Legal Abortions ShouldBe PossibleUnder SpecifiedCircumstances,by Religion and Education,U.S. 1965 (N : 1,368;Cell Frequencies in Parentheses).
27
Percentage AcceptingAbortion by Religion and Education (HypotheticalData).
28
PercentMilitant by Religiosity,and PercentMilitant by Religiosity Adjusting (Standardizing)for Religiosity Differencesin Educational Attainment,UrbanNegroesin the U.S., 1964(N : 993).
30
1.3. 1.4. 1.5. 1.6. Ll.
2.2. 2.3.
2.4.
PercentageDistribution of Beliefs Regardingthe Scientific View of Evolution(U.S.Adults,1993.1994.and2000).
2.5.
Percentage Accepting the ScientificView of Evolution by ReligiousDenomination(N : 3,663).
2.6.
Percentage Acceptingthe ScientificView of Evolution by Level of Education.
2.7.
Percentage Accepting the ScientificView of Evolution by Age.
2.8.
Percentage Distributionof Educational Attainmentby Religion
2.9.
PercentageDistribution ofAge by Religion.
2.10.
Joint ProbabilityDistribution of EducationandAge.
33
35 35 36
Xll
Tables,Figures,Exhibits,and Boxes
2 .11. PercentageAccepting the ScientificView of Evolution by Religion, Age, and Sex (PercentageBasesin Parentheses) 2.12. ObservedProportionAccepting the ScientificView of Evolution, and ProportionStandardizedfor EducationandAge. 2.r3. PercentageDistribution of OccupationalGroupsby Race,South African Males Age 2069, Early 1990s(Percentages ShownWithout Controlsand also Directly Standardizedfor Racial Differencesin EducationalAttainment";N = 4,004). 2.14. Mean Number of ChineseCharactersKnown (Out of 10), for Urban and Rural ResidentsAge 2069, China 1996(MeansShown Without ControlsandAlso Directly Standardizedfor UrbanRural Differencesin Distribution ofEducation; N : 6,081). FrequencyDistribution ofAcceptanceof Abortion by Religion andEducation,U.S.Aduits, 1965(N : 1,368). Social Origins of Nobel Prize Winners(19011972)and Other U.S. Elires (and,for Comparison,the Occupationsof EmployedMales i9001920). 3.3. MeanAnnual Income in 1979Among ThoseWorking Full Time in 1980,by Educationand Gender,U.S. Adults (Category FrequenciesShownin Parentheses). Meansand StandardDeviationsof Income in 1979bv Education and Gender,U.S. Adults, 1980. 3.5. MedianAnnual Incomein 1979Among ThoseWork rg Full Time in 1980, by Educationand Gender,U.S. Adults (CategoryFrequencies Shownin Parentheses).
6.2.
6.3. 6.4.
PercentageDistribution Over Major OccupationGroupsby Race and Sex,U.S. Labor Force, 1979(N : 96,945). Mean Number of PositiveResponsesto an Acceptanceof Abortion Scale(Range:07), by Religion, U.S. Adults, 2006. Means,StandardDeviations,and CorrelationsAmong Variables Affecting Knowledgeof ChineseCharacters,EmployedChinese Adults Age 2069, 1996(N = 4,802) Determinantsof the Number of ChineseCharactersConectly Identifiedon a TenItemTest,EmployedChineseAdults Age2U69,1996 (StandardEnors in Parentheses). Coefficientsof Models ofAcceptanceofAbortion, U.S. Adults, 1974 (StandardErrors Shownin Parentheses); N : 1,481. GoodnessofFitStatisticsfor Altemative Models of the Relationship Among Religion, Education,andAcceptanceofAbortion, U.S. Adults, 1973(N = 1,499). DemonstrationThat Inclusionof a Linear Term Does Not Affect PredictedValues.
37 39
4l
42 48 51
52
58 60
101
115
116 127
136
153
Tables, FiguretExhibits. and BoxesXiii ":
"i
.4
Cefficiens for a Linear Spline Model of Trends in years of Sciool Compleredby year of Birth, U.S. Adults Age 25 and Older, ad Comparisonswith Other Models (pooled Datafor 1972_2004, \ : 19.324). GoodnessofFitStatisticsfor Models of Knowledgeof Chinese Cbaactersby year of Birth, Controlling for years of Schooling, rirh \arious Specifications of the Effect of the Cultural Revolution rTbose Affected by the Cultural Revolution Are Deflned peoole as Tuning Age I I During the period 1966ttuough 1977),Chinese {dnlts Age 20 ro 69 in 1996(N = 6,086). Cocfficientsfor Models 4, 5, and 7 predicting Knowledgeof Chinese Charactersby year of Birth, Controliins for ye;rs ( p Valuesin parentheses). of Scbooti_ng
s
CoefficientsofModels of ToleranceofAtheists, U.S. Adults, 1[O to 2004 (N : 4,299). 6, Desiga Matrices for Alternative Ways of Coding Categorical \ariables(SeeText for Details). Coefficients for a Model of the Determinants of Vocabulary Knorrledge,U.S. Adults, 1994(N : 1,,757R2 : .2445: Sald TestThat CategoricalVariablesAll Equal Zetot F.t,rrrt = 12.48; p :.r.iation Membership. ::quenl  ::quenl Distribution of Occupationby Father'sOccupation, C:rnese{dults,1996. :,:;raction Parametersfor the SaturatedModel Applied to Table 12.9. G..odnessofFitStatisticsfor AlternativeModels of Intergenerational O,cupational Mobility in China(SixbySixTable).
'
275
276 278 280 282 284
F:;quency Distribution of EducationalAttainmentby Size of ?,::e of Residenceat Age Fourteen,ChineseAdults Not Enrolled :: School.1996.
289
P.rcentageEver Threatenedby a Gun, by SelectedVariables,U.S. {Jults. 1973to 1994(N : 19,260).
306
G..t dnessofFitStatisticsfor VariousModels Predictingthe P::ralenceof ArmedThreatto U.S.Adults, 1973to 1994. Eie!r Parametersfor Models 2 and4 of Table 13.2.
308 310
GoodnessofFitStatisticsfor VariousModels of the Processof ErucationalTransitionin Japan(PreferredModel Shownin Boldface).
315
Eiect Parameters for Model 3 ofTable 13.4.
316
OddsRatiosfor a Model Predictingthe Likelihood of Marriagefrom \Ee at Risk, Sex,Race,and Mother's Education,with Interactions Bet$ eenAge at Risk and the OtherVariables. Coeillcientsfor a Model of Determinantsof Nomenklatura \Iembership,Russia,1988.
328
Efiect Parametersfor a Probit Analysis of Gun Threat(Corresponding :.r \lodels 2 and4 ofTable 13.3).
331
Ettect Parametersfor a Model of the Determinantsof English and RussianLanguageCompetencein the CzechRepublic, 1993 p Valuesin Italic.) \ : 3,945).(StandardErrors in Parentheses;
339
Eftect Parametersfor an OrderedLogit Model of Political Party Identification, U.S.Adults, 1998(N : 2,443).
345
PredictedProbability Distributionsof Party Identificationfor Black and nonBIackMales Living in Large CentralCities of NonSouthern S\lSAs and Earning $40,000to $50,000perYear.
349
XVi 14.4. 14.5. 14.6. 14.7.
15.1. 15.2. 15.3.
Tables,Figuret Exhibits,and Boxes Effect Parametersfor a GeneralizedOrdercdLogit Model of political Party Identification,U.S. Adults, 1998. Effect Parametersfor an Ordinary LeastSquares Regression Model of Political party ldentification,U.S. Adults, 199g. Codesfor Frequencyof Sex in the Pastyear, U.S. Adults, 2000. AlternativeEstimatesof a Model of Frequencyof Sex,U.S Adults, 2000 (N : 2,258).(StandardErrors in parenthesesl All CoefficientsAre Significantat .001 or Beyond.) SocioeconomicCharacteristicsof ChineseAdults by Size ofplace of Residence,1996. Comparisonof OLS and FE Estimatesfor a Model of the Determinantsof Family Income,ChineseRMB, 1996(N : 5,342). Comparisonof OLS and FE Estimatesfor a Model of the Effect of Migration and Remittanceson SouthAfrican Black Children,s SchoolEnrollment,2OO2to 2003.(N(FE) : 2,408 Children; N(full RE) = 12,043Children.)
350 354 356
357 373 374
379
FIGURES 2 .1.
The ObservedAssociationBetweenX andy Is Entirelv Spurious and Coes to Zero When Z Is Controlled.
2.2.
The ObservedAssociationBetweenX andy Is partlv Sourious: theEffecrof X on Y ls ReducedWhenZ Is Controll;d(Z Affecrs X and Both Z and X Affect Y). The ObservedAssociationBetweenX andy Is Entirely Exolained by the InterveningVariableZ and Goesto Zero When 2 Is bontrolled. The ObservedAssociationBetweenX andy Is partly Explainedby the InterveningVariableZ: the Effect of X on y Is ReducedWhen Z Is Controlled(X Affects Z, and Both X and Z Affecr y).
2.5. 2.6.
4.1. 5.1. 5.2.
Both X and Z Affect Y, but ThereIs no AssumptionRegarding the CausalOrdering of X and Z. The Size of the ZeroOrderAssociationBetweenX andy (andBetween Z andY) Is Suppressed When the Effects ofX on Z andy haveOpposite Sign, and the Effects ofX and Z ony haveOppositeSign. An IBM punch card. ScatterPlot of Yearsof Schoolingby Father,syears of Schoolins (HypotheticalDara.N : t0). LeastSquares RegressionLine of the RelationBetween Yearsof Schoolingand Father'sYearsof Schoolins.
24 24 25
26 11
88 89
T Tables, Figures, Exhibits. and Boxes XVii .:.:.iuares RegressionLine of the RelationBetweenyears S:: .'irn,sand Father'sYearsof Schooling,ShowingHow the '::::: Prediction"or "Residual"Is Defined. ': ;..:Squares RegressionLines for Three Conligurationsof Data: : ::::.rl Independence, (b) PerfectCorrelation,and (c) perfect . ::;ear Correlationa ParabolaSymmetricalto the XAxis. :: Ie;r of a SingleDeviantCase(High Leveragepoint).  :'.:=:lng DistributionsReducesCorrelations.  :: iiecr of Aggregationon Correlations. of the Relationship Between ::: DimensionalRepresentation \:::er of Siblings,Father'sYearsof Schooling,andRespondent,s ::. ri Schooling(Hypothetical Data;N : l0).
90
92 95 97 99
105
:r:e;:ed \umber of ChineseCharactersIdentified (Out of Ten) , . \:,r: ol Schoolingand Gender,Urban Origin ChineseAdults Age 20 : : ::r 1996with NonmanualOccupationsand with years of Father,s S: :l.ine andLevelof CulturalCapitalSetat TheirMeans(N : 4,g02). \::e: ihe temaleline doesnot extendbeyondl6 because thereareno :'::.".esin the samplewith postgraduate education.) 120 :,j':pranceofAbortion by EducationandReligiousDenomination, 131 .S. \dulrs.1974(N : 1.481). .: RelationshipBetween 2003 Income andAge, U.S. Adults .{:: Ttlen*'to SixtyFourin 2004(N : 1,573). t4l :r;ted 1n(Income) by YearsOf SchoolCompleted, U.S. Males Females.2004, with Hours Workedper WeekFixed at the :: l'1irntbr Both SexesCombined(42.7;N : 1,459). 1,44 ir:e.ied Incomeby Yearsof SchoolCompleted, U.S. Malesand ::neles. 2004,with Hours Workedper Week Fixed at the Mean for 3.rhSeresCombined(42.7). 145 ::end in ArtitudesRegardingGenderEquality,U.S.AdultsSurveyed : i9rl Through1998(LinearTrendandAnnualMeans;N=21,464). 151 f:arsof SchoolCompletedby Yearof Birth, U.S.Adults (pooled S:mplesfrom the 1972Through2004GSS;N = 39,324;Scatter Pr.rtShownfor 5 PercentSample). 154 \lean Yearsof Schoolingby Yearof Birth, U.S. adults(SameData :i tbr Figure7.5). 155 TlueeYearMoving AverageofYears of Schoolingby year of Birth, L.S. Adults(SameDataasfor Figure7.5). 155 Trendin Yearsof SchoolCompletedby Year of Birth, U.S. Adults SameData as for Figure 7.5). PredictedValuesfrom a Linear Splinewith a Knot at 1947. 158
XVlll
Exhibits, andBoxes Tables, Figures,
7 .9.
Graphsof ThreeModels of the Effect of the Cultural Revolution on VocabularyKnowledge,Holding ConstantEducation (at TwelveYears),ChineseAdults, 1996(N : 6,086).
7.10. 10.1. 10.2.
10.3. 10.4.
Figure 7.9 Rescaledto Show the Entire Rangeof the YAxis. Four ScatterPlots with Identical Lines.
163 163 226
ScatterPlot of the RelationshipBetweenX andY andAlso the RegressionLine from a Model That IncorrectlyAssumesa Linear RelationshipBetweenX andY (HypotheticalData).
227
Yearsof School Completedby Number of Siblings,U.S. Adults, 1994 (N  2,992). Yearsof SchoolCompletedby Number of Siblings,U.S. Adults, 1994.
10.5.
A Plot of LeverageVersusSquaredNormalizedResidualsfor Equation7 in TreimanandYip (1989).
10.6.
A Plot of LeverageVersusStudentizedResidualsfor Treimanand Yip's Equation7, with Circles Proportionalto the Size of Cook's D.
lO.7.
AddedVariablePlots for Treiman andYip's Equation7. Plot for Treiman andYip's Equation7. ResidualVersusFitted
10.8.
Plots for Treimanand AugmentedComponentPlusResidual Yip's Equation7. 10.10. ObjectiveFunctionsfor ThreeM Estimators:(a) OLS Objective Function,(b) Huber ObjectiveFunction,and (c) BiSquare ObjectiveFunction.
228
zz8 232 233 233 234
10.9.
10.11. SamplingDistributionsof BootstrappedCoefficients (2,000Repetitions)for the ExpandedModel, Estimatedby RobustRegressionon SeventeenCountries. 11.1. 13.1. 13.2. 13.3. 13.4.
13.5.
235
238
240
Loadingsof the SevenAbortionAcceptanceItems on the First Two 255 Factors,Unrotatedand Rotated30 DegreesCounterclockwise. ExpectedProbability of Marrying for the First Time by Age at 320 Risk,U.S.Adults, 1994(N = 1,556). Risk the First Time by Age at ExpectedProbability of Marrying for (Range:Fifteen to ThirtySix), DiscreteTimeModel, U.S. Adults, 1994. 3ZZ ExpectedProbability of Marrying for the First Time by Age at Risk (Range:Fifteen to ThirtySix), Polynomial Model, U.S. Adults, 1994. ExpectedProbability of Manying for the First Time by Age at fusk, Sex, and Mother's Education(Twelveand SixteenYearsof Schooling), NonBlack U.S. Adults, 1994. ExpectedProbability of Marrying for the First Time by Age at Risk, Sex,and Mother's Education(Twelveand SixteenYearsof Schooling),Black U.S.Adults, 1994.
322
326
326
Tables,Figures.Exhibits,and Boxes XIX
:,:.8.1. ProbabilitiesAssociatedwith Valuesof Probit and Logit Coefficients. +.l. 11.1. 16.1. 6.1.
ThreeEstimatesof the ExpectedFrequencyof Sex per Year, U.S. Married Women,2000 (N : 552). ExpectedFrequencyof Sex PerYearby Genderand Marital Status, U.S.Adults,2000(N : 2,258). 1980Male Disability by Quarterof Birth (Preventedliom Work by a PhysicalDisability). Blau andDuncan'sBasicModel oflhe Processof Stratification.
JJ{
358
359 386 394
EXHIBITS :. 1 :2.
lllistration of How Data Files Are Organized. A CodebookCorresponding to Exhibit4.1.
67 68
BOXES
Statado Files and Jog Files Direct StandardizationIn Earlier SurveyResearch
3 6 9 10 14 15 16 18 22 27 30 31
The Weaknessof Matching and a Useful Fix
44
TechnicalPointson Table3.3
53 54 66 70 72 75
OpenEndedQuestions SamuelA. Stouffer TechnicalPointson Table 1.1 TechnicalPointson Table 1.2 TechnicalPointson Table 1.3 TechnicalPointson Table 1.4 TechnicalPointson Table 1.5 TechnicalPointson Table 1.6 Paul Lazarsfeld HansZeisel
SubstantivePointsOn Table3.3 A Histodcal Note on Social ScienceComputerPackages HermanHollerith The Way Things Were TreatingMissing Valuesas If They Were Not
XX
Tables,Figures,Exhibits,and Boxes
PeopleGenerallyLike to Respondto (WellDesigned andWellAdministered)Surveys Why Use the " Least Squares" Criterion to Determine the BestFittingLine? Karl Pearson A Useful Computational Formula for r A "Real Data" Exampleof the Effect of Truncatingthe Distribution A Useful ComputationalFormulafor 12 Multicollinearity ReminderRegardingthe Varianceof DichotomousVariables A Formula for ComputingR':from Conelations Adjusted R'? Always PresentDescriptiveStatistics TechnicalPoint on Table6.2 Why You ShouldInclude the Entire Samplein Your Analysis Gettingpvaluesvia Stata Using Statato Comparethe Goodnessoffitof RegressionModels R. A. (RonaldAylmer) Fisher
17 9I 93 93 97 101 108 110 111 r1 1 114 117 122
r25 125 126
How to Test the Significanceof the Difference BetweenTwo Coefficients Altemative Ways to EstimateBIC
129
Why the RelationshipBetweenIncome andAge Is Curvilinear
140
A Trick to ReduceCollinearity
145
In SomeYearsof the GSS,Only a Subsetof Respondents WasAsked CertainQuestions
150
134
An AlternativeSpecificationof SplineFunctions Why Black versusNonblack Is Better Than White versus Nonwhite for SocialAnalysis in the United States
156
A Commenton Credit in Science Why PairwiseDeletion ShouldBe Avoided
175
TechnicalDetailson lhe Variables TelephoneSurveys
188
Mail Surveys
r99 200 202 205
Web Surveys Philip M. Hauser A SuperiorSamplingProcedure
175 183 198
Tables, Figures, Exhibits, and BoxesXXi Strurces of Nonresponse ["eslieKish Hos the ChineseStratifiedSampleUsed in the Design Erperimentswas Constructed $ii,ehdng Data in Stata Limitarions of the Stata10.0 SurveyEstimationprocedure {n {lternativeto SurveyEstimation Ho\l to DownweightSampleSize in Stata Eirs to AssessReliability $h1' the SAI and GRE TestsInclude SeveralHundredItems TransformingVariablesso That ,,High,'has a ConsistentMeaning ConstructingScalesfrom IncompleteInformation h LogLinearAnalysis "Interaction',Simply Means ,Association,, l: Defined Other Softwarefor EstimatingLogLinear Models \larimum Likelihood Estimation ProbitAnalysis Techdcal Point on Table 13.1 Limitations of Wald Tests SmoothingDistributions EstimatingGeneralizedOrder Logit Models With Stata JamesTobin PanelSurveysin the PublicDomain Otis Dudley Duncan SewellWright \sk a Foreigner To Do It GeorgePeterMurdock ln the United States,Publicly FundedStudiesMust be Made Available to the ResearchComrnunity Al'Available from Aulhor" Archive
207 ?08 212 2,13 215 219 219 244 245 248 249 264 267 294 302 302 305 309 325 349 354 369 395 396 398 401 404
, , ,__ :l ,:i ,
,"
.a.
: , :. a book abouthow to conducttheoreticallyinfomed quantitativesocialresearch ":: .. socialresearchto testideas.It derivesfrom a coursefor graduatestudentsin sociprofessionalschools(public :, .rnd other social sciencesand social sciencebased .. education,socialwelfare,urbanplanning,and so on) that I havebeenteachingat  .t tbr somethirty years.The coursehasevolvedasquantitativemethodsin the social , ::::s haveadvanced;early versionsof the coursewere basedon the first half of this .., r throughChapterSeven),with additionalmaterialsaddedover the years.Interest::.. I havebeenableto retainthe sameformat a twentyweekcoursewith onethree::: e.tureper week and a weekly exercise,culminatingin a term paperwritten dudng i .it lbur weeksof the course from the outset,which is, I suppose,a tributeto the .:=sing level of preparationand quantitativecompetenceof graduatestudentsin ::= ...ial sciences.The book owes much to lively classdiscussionsover the years,of :: :ubtle andcomplexmethodologicalpoints. tsr rheendof the book,you shouldknow how to makesubstantive senseof a body of data. you That is, prepared should be well produce to publishable papersin :,:::ative :: neld. as well as firstratedissertationchapters.Of course,thereis alwaysmore to :=:. In the final chapter(ChapterSixteen),I discussadvancedtopics that go beyond ; '.: .an be coveredin a first coursein dataanalysis. Tie focusis on the analysisof datafrom representative samplesof welldefinedpop ,:, rns.althoughsomeexceptionsareconsidered.The populationscanconsistof almost societies,occupations,pottery shards,or whatl :rns people,formal organizations,  ::. ihe analytic issuesare essentiallythe same.Data collectionproceduresare men :J only in passing.Thele simply is not enoughspacein an alreadylengthybook to do .::re to both data analysisand datacollection.Thus, you will needto look elsewhere r .i stematicinstructionon datacollectionprocedures. A strongcasecan be madethat .hould do this after rather than before a courseon data analysisbecausethe main :. : emin designinga data collectionefforl is decidingwhat to collect, which means  irst needto know how you will conductyour analysis.An altemativemethod of ::ring aboutthe practicaldetailsof datacollectionis to becomean apprentice(unpaid, : ,:;essary) to someonewho is aboutto conducta surveyand insistthat you get to par,:::ate in it stepbystep evenwhenyour presence is a nuisance. Thisbookcoversa varietyoftechniques,includingtabularanalysis,loglinearmodels r :abulardata,regressionanalysisin its variousforms,regressiondiagnosticsandrobust .::\sion, ways to cope with missing data,logistic regression,factorbasedand other :::.niquesfor scaleconsnxction,andfixed andrandomeffects modelsasa way to make ,..al inferences.But this is not a statisticsbook; the emphasisis on usingtheseproce::;s to drawsubstantive conclusionsabouthow the socialworld works.Accordingly,the :' ..kis designedfol a courseto be taken after a firstyeargraduatestatisticscoursein : rocial sciences.Although thereare many equationsin the book. this is becauseit is
XXIV
Preface
necessa.ry to understandhow statisticalprocedureswork to usethernintelligently. Because the emphasisis on applications,there are many worked examples,often adaptedfrom my own research.In addition to data from samplesurveysI haveconducted,I also rely heavily on the GeneralSocial Survey,an omnibussurveydesignedfor use by the research community and also for teaching.Appendix A describesthe main data sets used for the substantiveexamplesand provides information on how to obtain them; they are all availablewithout cost. The only prerequisitesfor successfuluseof this book are a prior graduateJevelsocial sciencestatisticscourse,a willingnessto think carefullyandwork hard,andthe ability to do high school algebraeither rememberedor relearned.With only a handful of exceptions (referencesat one or two points to calculus and to matrix algebra),no mathematics beyondhigh school algebrais used.If your high schoolalgebrais rusty, you can find good reviews in Helen Walker, Mathematics Essential for Elementary St,,tistics, and W. L. Bashaw,Mathematicsfor Statistics.These books have been around forever. Although more recent equivalentsprobably exist, school algebra has not changed,so it hardly matters.Copiesof thesebooksarereadily availableat amazon . com, andprobably many otherplacesaswell. The statisticalsoftwarepackageusedin this book is Srara(release10). Downloadable commandfiles (do files in Stata'sterminology),files of results(1og files), and ancillary computer files used in the computations are available at wwwjosseybass. conr/golquantitativedataanalysis Often the details underlying particular computationsare only found in the downloadable do  and  1og  files, so be sureto downloadandstudythemcarefully.Thesefiles will be updatedasnew releasesof Statabecomeavailable. I use Statain my teachingand in this book becauseit has very rapidly becomethe statistical packageof choicein leadingsociologyand economicsdepartments. This is not accidental.Statais a fast and efficient packagethat includes most of the statistical procedures of interest to social scientists,and new commandsare being addedat a rapid pace. Although many statistical packagesare available, the thrce leading contenderscurrently are Stata,SPSS,and SAS. As software,Statais clearly superiorto SPSSit is faster, more accurate,andincludes a wider rangeof applications.SAS, althoughvery powerful, is not nearly as intuitive as Stata and is more difficult to learn (and to teach). Nonetheless, this book canbe readilyusedin conjunctionwith eitherSPSSor SAS, simply by translating the syntaxofthe Statado files.(I havedonesomethinglike this,exploitingAllison's excellent,but SASbased,expositionof fixed andrandomeffects models[Allison 2005] by writing the correspondingStatacode.)
FORINSTRUCTORS Somenotes on how I have usedthesematerials in teaching may be helpful to you as you designyour own course. As noted previously,the courseon which this book is basedruns for two quarters (twenty weeks). I have offered one threehour lecture per week and have assignedan exerciseeveryweek.When I fust taughtthe course,I readtheseexercisesmyself,but as
Preface
XXV
::: ::rentshaveincreased,I haveenjoyedthe servicesof a T.A. (chosenfrom among ::.:. $ ho haddonewell in the coursein previousyears),who assistsstudentswith the : .::ies of computingand statisticsand also readsand commentson the exercises.In lecturesandhaveassignedexercisesfor all but the :::r: \eais. I haveofferedseventeen '.'' the course devotedto producingtwo draftsof a term paper .  :ih rhe final monthof : :::rJirihon sessionI readthe first draftsandwrite comments,in an attemptto emulate : :  : :nal submissionprocess.Thus, in my course,everyonegetsa "reviseand resub::: :i>ponse.I encouragestudentsto developtheir telm papersin the courseof doing andto completetheir draftsin the two weeksafter the lastexerciseis due. := .:::ises l;: initial exercisesare designedto lead studentsin a guided way through the , ::::rics of analysis,and someof the later exercisesdo this as well. But the exercises  ::.:nglr take a free form: "carry out an analysislike that presentedin the book." ,,:.:ir e answersareprovidedfor thoseexercisesthat involvedefinitiveanswers that , , .3 sinIilarto statisticsproblemsets. :3 .oursesyllabus,weekly exercises,andillustrativeanswersto thoseexercisesfbr i:[ have written illustrative answersare availablefor downloadingfrom www. : ::_.r.i:s.com/go/quantitativedataanalysis
ACKNOWLEDGMENTS ,. , r:3dearlier,this book hasbeendevelopedin interactionwith manycohortsof gradu:. .::dents at UCLA who havewrestledwith eachof the chaptersincludedhere and : . :erealed troubles in the exposition, sometimesby way of explicit comments   : r:nerimesvia displaysof confusion.The book would not exist without them, as I :: :: naginedmyselfwdting a textbook,and so I owe themgreatthanks.Onein partic. ?.rmelaStoddard,literally causedthe book to be publishedin its currentfolm by : ::.:ing in the courseof a chanceairplaneconversationwith Andrew Pastemack,a ...' , Bassacquisitionseditor,that her professorwas thinking of publishingthe chap.. . : usedas a coursetext.Andy contactedme, andthe restis history. h: courseon which this book is basedfirst cameinto being throughcollaboration i : :r] colleagueJonathanKelley, when he was a visiting professorat UCLA in the   .. The first exerciseis borrowedfrom him, andthe generalthrustof the course,espe  . :re lirst half, owesmuch to hrm. \ly colleague,Bill Mason,recentlyretiredfrom the UCLA Sociologyand Statistics ..:::rients, hasbeenmy statisticalguru for manyyears.Otien I haveturnedto him lbr :: i::s irto difficult statisticalissues.And much that I have learnedabout topics that ;: : :roi part of the cuniculum when I was a graduatestudenthas beenfrom sitting in ,: ::red statisticscoursesofferedby Bill. Anothercolleague,Rob Mare, hasbeenhelp . :nuchthe sameway.My new colleague,JennieBrand,who took over my quantita : :;ia analysiscoursein the fall of2008, hasreadthe entiremanuscdptandhasoffered relptul suggestions. Finally, the book hasbenefitedgreadyfrom very carefulread.,. .: :l' a group of about 100 Chinesestudents,to whom I gavea specialversionof the , :.: in an intensivesumner sessionat Beijing University in July 2008.They caught
XXVI
Preface
ftmy errors that had gone unnoticed and mised often subtle points that resulted in the reworking of selectedportions of the text. My understanding of research design and statistical issues, especially conceming causality and theats to causal inference, has benefited greatly from the weeHy seminar of the Califomia Center for Population Research,which brings together sociologists, economists, ald other social scientists to listen to, and corrment on, presentationsof work in progress,mainly by visitors from other campuses.The lively and wideranging discussionhasbeen somethingof a floating tutorial, a realization of what I haveimagined academiclife could and should be like. Finally, my wife, Judith Herschman,has displayed endlesspatience, only occasionally asking, "When are you going to finally publish your methodsbook?"
. : & L JYht ** H t Treiman is distinguishedprofessorof sociologyat the Universityof Califomia u s 1:.::s rLCLA) andwas until recentlydirectorof UCLA's Califomia Centerfor aorurr,:r Re:earch.He hasa BA from ReedCollege(1962)and an MA andphD from ! n..:r .'fChicago(1967).As a graduatestudentat Chicago,he spentmostofhis .f, \aiional Opinion ResearchCenter(NORC), wherehe gainedvaluabletrain_ : .Er: :1nence in surveyresearch.He then taught at the University of Wisconsin, rntae :l :e,ided that he really was a social demographerat heart, and made the Center ru }:,1:rrph1 and Ecology his intellectualhome. From Wisconsin,he moved to I 'rrrrn; Lnirersitv and then, in 1975,to UCLA, wherehe has beenever since,albeit qd E\i=J1 so.;ournselsewhere,as staff director of a study committee at the National r;rrr='. .:: Sciences,4.Jational ResearchCouncil (19781981)and fellowship yearsat Bl:eau ofthe Census(19871988), theCenterfor AdvancedStudyin theBehav_ i umr rc S.rialSciences(1992 1993),andthe NetherlandsInstitutefor AdvancedStudy r M and SocialSciences(19961,997). l::.or Treiman startedhis careeras a studentof social stratificationand status ::rrniries il.!yn..: parricularlyfrom a crossnationalperspective,and this has remained a con_ i'Fr._r :::3resr.He andhis Dutch colleague,Harry Ganzeboom,have beenengagedin a {mr€:= project to analyzevariationsin the statusattainmentDrocess rossnational [irrlr. :::!lD! throughoutthe world over the courseof the twentiethcentury.To date, tEl r:,: ;ompiled an archiveof more than 300 samplesurveysfrom more than 50 m:cs =ngrns through the last half of the century. In addition to his comparativeproj_ s ?:: ::sor Treimanhas conductedlargescalenationalprobability samplesurveysin ir@ \.,a  19911994),EastemEurope( 19931994),andChina(1996),all concemed q [ .J::.u! aspectsof socialinequality. :lj .Lrent researchhasmovedin a more demographicdirection.He hasa national !r.rr!'::\ lample surveycurrentlyin progressin China,which focuseson the determ! m. :i:amics. andconsequences of internalmigration.
:r,{rK*milcTl*ru I . :or uncommonfor statisticscoursestakenby graduatestudentsin the socialsciences x :E [eated essentiallyasmathematicscourses,with substantialemphasison derivations rnc:roofs. Evenwhenempiricalexamplesareusedwhich they frequentlyarebecause 1947and = 0 otherwise.Thenfor thoseborn in 1947or earlier, t : a + b,(X) + b,(o) : a + b1(X) whilefor thoseborn laterthan 1947 E: a + b,(X) + b2(X 44 Thus,for those born in 1948,the expectedlevelof educationis given by (a + 48b,) + b,; for thoseborn in 1949it is (a + 4gb1)+ 2br;and so on. Fromthis,it is evidentthat b, give; the deviationof the slopefor the previouslinesegment.Forusefuldiscussions of thesemethods,seeSmith(1979)and Gould(1993).
r.'!ultiple Regression Tricks: Techniques for Handringspeciar Anaryticprobrems 157 ' . 16
'
r a,S ?.3, Co.ffi.i.rrts fora LinearSplineModel of Trends years in of School Completed by year of Birth, U.S. Aiults age iS Ofa.., comparisonswith other Moders(pooredDatar". "na = ".rO r6'izzoo+,
rv
r f . , 3Ltion tbr rPecred v *lose tion lbr eoeffia linear earb] : that a osiring spbne tion. is
39,324),
s,e. 5bpe '. :: ,:..'i'.: 5.ope(bjrthyearsI94j1979)
,i.:: .0092
.o024
r*""u1,rr,1.,:, Model Comparisons
2) Lineartrendmodel
.1167
(3) :. i5
I ) vs.(2)
5 31
.0121
545.2
1;39321 .OOO0
::arly inferior by the BIC criterion,and occurs simply as a consequence of the large imple size.Thus, I acceptthe linear splinemodel asttrepreteneJmoO"t. The coefficientsfor the line segmentsindicatethat for peopt"iorn in 1947or earljer, :ere is an expectedincreaseof .0g6yearsof schooling foi ,*""rriu" birth cohort. .._. us.peopleborn lwelve yearsapartwould be expectld "uin to differ on averageby abouta 1:ar of schooling.However,for peopleborn in 1947or later,;";" ." trendin educa:.rnalattainment;the coefficient .0092 implies ttut it *ouiO iut "about a century for :. eraqeschoolingto increaseby one year This is a somewhatsurpnsrng result, espe_ :::lly becausetherehavebeensu
.":nraged minoriries, rhat is, ",""1::#'iff'":!ilT":fi::rTli:,ffiH:tr"Hi,
::d also amongwornen.However,as Mare (1995, tb:; not"r.d*utronally disadvan_ !ed proportionsof the population havegrown over tim"..tutlu" to tn" White majority. )saggregation of the trend woula be wolrthwhile u* p"r.ued here;it would :rte,an interestingpaper The graph implied by the"""".ii. coefficienisfor the linear spline :Lrdel is shown.inFigure 7.g, togetherwiih u 2 i".".nt rundornslmpte of observations : rr eachcohort_(redlced from 5 percentto 2 percentto mut. it .J". to seethe shapeof :e.spline). In this figure the j itterfeaiurein Stutui, u."J'io _uke it clearwhere : rhegraphthereis the greatestdensityof points.
158
DataAnalysis: DoingSocialResearch Quantitative to Testldeas
". :,t'..i i f
?.... .'.t'
lfi
_g E .r
t
t.
iifirEr,
t
flrcfl diiMd
o fr
N[dd dbi
libu
%i
o
llrry
btrt
It
hr
l$." .!F, tEd l
'1900
1910
1920
1930
1940 1950 Yearof birth
1960
1970
1980
@m h ftr/rqi trtil
Ff &Untr 7 .&, rrenain Yearsof SchootCompteted by yearof Birth,U.S.Adutl (SameDataasfor Figure7.5;ScatterPlotShownfor 2 percentSampte). predicted Valuesfrom a LinearSplinewith a Knot at 1947.
tuq drF
A SecondWorked Example,with a Discontinuity: euality of Education in China Before, During, and After the Cultural Revolution The typical useof splinefunctionsis to estimateequationssuchasthe onejust discussedin which all points are connectedbut the slope changesat specifiedpoints (,,knots"rHowever, there are occasionsin which we may want to posit discontinuoas functionsThe Chinese Cultural Revolution is such a case.It can be argued that the disruption of socialorder at the beginningof the CulturalRevolutionin 1966was so massivethat it js inappropriateto assumeany continuity in trends.Deng and Treiman (1997) makejus such an argument with respect to trends in educational reproduction. They argue thal there was then a gradual 'tetum to normalcy" so that changesresulting from the end of the Cultural Revolution in 1977 were not nearly as sharp and were appropriately representedby a knot in a spline function rather than a break in the trend line. Here we consideranotherconsequence of the Cultural Revolution,the quality of educationreceived(the exampleis adaptedfrom Treiman [2007a]).Although prima4 schoolsremainedopen thoughout the Cultural Revolution,higher level schoolswere shutdown for varyingperiods:most secondaryschoolswereclosedfor two years,from 1966to 1968,and most universitiesand other tertiarylevel institutionswere closedfor six years,from 1966 to 1972. Moreover,it was widelv reDortedthat even when the
m
lr"D. lhEb
h br& {@E
fu r frFfr ffi{
ryE'ft bd rlidh' Ed &trI
hr mb &'nn b
litultipleRegression Tricks: Techniques for HandlingSpecial AnalyticProblems 159
0 ldutts Pd
n hste.l lols 1,
:tiorr. ion e.: ar it i: Ie JL.:i e thar end of reprelirr eti
rima+ t $ efe
. from ed for en lhe
siools were open, little conventionalinstruction was offered: rather, school hours rere taken up with political meetingsand political indoctrination.Rigorousacademic himrction was not fully reinstituteduntil 1977, after the death of Mao. Under the ;iriumstances,we might well suspectthat, quite apart from deficits in the affiount of siooling acquiredby thosewho wereunfortunateenoughto be of schoolageduringthe Culmral Revolution period, those cohorts also experienced deficits in the quality of $ooling comparedto thosewho obtainedan equalamountof schoolingbeforeor after fre Cultural Revolution. To test this hypothesis,we can exploit the tenitem characterrecognition test ,&iristered to a nationalsampleof Chineseadultsthat was also analyzedin Chapter SLx(seeTable6.2). As before,I take the numberof characterscorrectly identified as a of literacy andhypothesizethat, net of yearsof schoolcompleted,peoplewho asure age eleven during the Cultural Revolution would be able to recognizefewer rned [Laractersthanpeoplewho turnedelevenbeforeor after the Cultural Revolutionperiod. Uoreover,following Deng andTreiman (1997),I posit a discontinuityin the scoresat tu beginningbut not at the end of the period. To do this, I estimatean equationof fre form:
i  a + b1(B)+ bz(B) + cr(Dr) + \(\)
(7.38)
rhere B, = year of bfuth (last two digits) if born prior to or in 1955 and : 55 ifbom Fbsequentto 1955;Br: 0 if bom prior to 195(, = year of birth  55 if born between 1956and 1967,inclusive, and : 67  55 if bom subsequentto 1967',83: 0 if bom = 0 for lrior to or in 1,967and : year of birth  67 for those bom after 1967i and D, : 1955. Note difference prior 1 for those bom after that the born to or in 1955 and 6ose henveenthis and Equation7.35 is that I include a dummy variableto distinguishthose born after 1955from thoseborn earlier;this is what permitsthe line segmentsto be disat 1955.If I were to havepositeda discontinuityat 1967as well, the equarr')otinuous :ion would be the mathematicalequivalentto estimatingthree separateequations,for rte periodbefore,during, and after the Cultural Revolution,in eachcasepredictingthe rtrmberof charactersrecognizedfrom yearsof schoolingand year of birth. The advanage of equationssuchas Equation7.38 is that they permit the specificationof altematire modelswithin a coherentframeworkand by so doing permit us to selectbetween nodels. Estimatingthis equationyields the resultsshownfor Model 4 in Tables7.3 and 7.4. {s in the previous example,I contrastmy theorydriven specification with other possibiliries: that there is a simple linear trend in the data; that there are yearbyyearvariations; tat there are knots at both the beginning and the end of the Cultural Revolution, but no discontinuities; that there are discontinuities at both the beginning and the end of the Culural Revolution; and, for the three spline functions, that there is a curvilinear relationship between year and knowledge of characters during the Cultural Revolution period.
l6S
QuantitativeData Analysis:Doing SocialResearch to Testldeas
''
'inla
Ra^r.rr
':,.:l:
: ,'.l' GoodnessofFit statistics for Models of Knowledge of chinese Characters by year of Birth, Controlling for years of schooling, with Various Specifications of the Effect of the Cultural Revolution (Those Affected by the Cultural Revolution Are Defined as people Turning Age il During the Period 1966 through 19771,Chinese Adutts Age 20 to O9 in 1996 (N = 6,08G),
: 'Chinese Char a( lues in Paren Va
':=i o: schocl;:l: .665 .616
i 956'196: .g
i:i6725.9
6723.9
.612
 6722.1
.611
:: 

6724.1
. 2A 71.72
6717.4
1116.33 :.
..1,i1,/:
'::(
€ar 1r€tc 'f5
. ':::
4.26  42.4
:a Ba . a .=' .
30.04 ::
54.43
.003
s1.11
1.8
.00'l
. 6.86
6.5
.000
'a a  a _ e a  :t

.
: :;l _1i Lrn i :
:::
' .
 i ddl l i Lrr:
:
. :  , t t r ing iit : a.  ::rruities.. : , ,  . likelr r : . :
'  t iple Re g re s s i olnri c k s :T e c h n i q u efo s r H andl i ngS peci aA na yti c P robl errs
] 5l
' , :, Coefficients for Models 4, 5. and 7 Predicting Knowledge : Chinese Characters by Year of Birth, Controlling for Years of Schooling :Va lues in Parentheses).
:: 's of schooling
 i 955 or earlier(age11 1965or earler)
:
:
19561967(age11 1966*1977)
: r   1968or l a te r(a g e1 1 1 9 7 8o r l a te r)
:  i: q inu: t ya t ' 9 5 5
Model4
Model 5
.443 (.000)
.443 (.000)
A44 (.000)
0.001 \.721)
0.001 (.134\
0.001 (.749)
0.043 (0.000).
0.032 (0.000)
0.041 (.000)
0.016
0.557 ( 000)
*0.508. (.000)
. o.o4l (0.18s) 0.028 (.012) 0.349. / nnl\
o.241 (.010)
, : : r nt inuit ya t 1 9 6 7
0.0066 (.00e)
:,llineartend'195ffi7
= : (rootmeansquareerror)
Model 7
0.770
0.770
0.771
0.571
0.672
o.672
1.29
1.29
1.29
. ,rnparison of the B.lCs suggeststhat three modelsmy hypothesized model, a model ,: in addition to a discontinuity at the beginning of the Cultural Revolution allows the =:J during the Cultural Revolution period to be curvilinear. and a model positing  .:ontinuities at both the beginning and the end of the Cultural Revolution are about , ..i1ly likely given the data, albeit with weak evidence favoring the singleknot model.  : that all three are strongly to be preferred over all other models.
162
QuantitativeData Analysis:Doing SocialResearch to Testldeas
Again, B1Cand classicalinferenceyield conrradictoryresultsbecausethe two alternativemodelsfit significantlybetter(at rhe 0.01 le\el) than doesthe originally hypothesizedmodel.Here I am in a bit of a quandaryas to u hich modelto prefer.I havealreadr stateda basisfor positing a single discontinuity.plus a knot at the end of the Cultura Revolution.However,anotheranalyst might favor a twodiscontinuitymodel, on th; groundthat the curricularreform in 1977that restoredthe primacyof academicsubjecc was radical enoughto posit a discontinuityat the end as well as at the beginningof the Cultural Revolution.A third analystmight arguethat a linear specificationof trends. especiallyin times of great social disruption,is too restrictiveand that it makesmore senseto posit a curvilinear effect of time during the Cultural Revolutionperiod. Ir Treiman(2007a),I presentedthe model positinga discontinuityat 1955,a knot at 196. and a curvebetween1955and 1967see Figure7.4 in that paper.Howevel the truth i! thatthereis no clearbasisfor preferringany oneof the three,exceptfor the evidenceprc' vided by BlC, which suggeststhat ihe originally hypothesizedmodel is slightly mor; likely thanthe othersgiventhe data.Again, my suggestionis, go with theory.If you har: a theoreticalbasisfor one specificationover the others,that is lhe one to feature;but. iI the sametime, you mustbe honestaboutthe fact that alternativespecificationsfit nearll equally well. In fact, the optimal approachis to presentall threemodelsand invite th: readerto chooseamongthem.A waming: if you do this, you probablywill haveto figL with journal editors, who are always trying to get authors to reduce the length of papersand perhapswith reviewers,who sometimesseemto want definitive conclusionsere: when the evidenceis ambiguous. The estimatedcoefficientsfor all threemodelsare shownin Table7.4.In alt thr*modelseachadditionalyearof schoolingresultsin nearlyhalf a point improvementin dE numberof charactersidentified.However,the coefficientsassociatedwith trendsortime are relativelydifficult to interyrer.Again, this is an instancein which graphingrtr relationshiphelps.Figure7.9 shows,for eachof the threepreferredmodels,the predicr* numberof charactersrecognizedfor peoplewith twelve yearsof schooling,that is, *bi havecompletedhigh school.Although the threegraphsappearto be quite different,th1 all show a declineof abouthalf a point in the numberof charactersidentifiedfor thos who wereageelevenduring the early yearsof the CulturalRevolutionperiod,relatile :: thosewith the samelevel of schoolingwho tumed elevenbeforeand after the Cuhwir Revolution.Thus, despitethe difficulty in choosingamong alternativespecificatio*togetherthey stronglysuggestthat the quality of educationdeclinedduring the Culruii, Revolution.Peoplewho acquiredtheir middle school(unior high school)educationdr:ing the CulturalRevolution,in effect,lost a year of schoolingthat is, displayedknos _edgeof vocabularyequivalentto thosewith one year lessschoolingwho wereeducai= before and after the Cultural Revolution. Still, we shouldbe cautiousin our interpretationof Figure 7.9, where the Culru:rr Revolutioneffect appearsto be quite large becauseof the way the data are graphi; (with the yaxis rangingfrom 5.3 to 6.7 charactersrecognized).Indeed,Figure 7.10_r which the yaxis rangesfrom 0 to 10, suggestsa ratherdiffereDtstory'a very mod:s decline in the numberof charactersrecognized.It is quire rea_.onable ro reporr li,su::: suchas Figure7.9 to makethe differencesamonsthe model:__. does not permit more lhan twoway products, I take advantageoiu ur"r_\Vrr,,"n __._ comman{ desmar (Hendrickx 1999, 2000,2001a,Z00ib), to specifyrhe requind variables.(Seethe downloadableflles ,,chlr_l.do,,and,,ch12_l.log,,tbr details.).{iiir; becauseglm doesnot provideall the coefficientsshownin Tablei2.3 andproduc._.1 incorrect estimateof B1C (given the way I have specifiedthe problem, _gLm_ counli r casesthe numberof cells in the tableratherthan ths numberof p^eople in ttre sampte),I b,a
LogLinear Anatysis 27 5
(ror"gociJness or nr").anda rerseversion, ,.,. '.ll"l:",::,,"i1"j"^^1::"^?:.;;;i3Bil""ii::*::::::l"""ts:these
_donr",,r,o..u,,iui,ixrlrJ,';.jll;
',_ _ .Jrble file. for lhis chapler _
rit
orherStara
estimarion command, rurr'rdrru. =_,i:l_:::T1111orkr,.Juu wirh wrtn , ,::iion: :v:ry ; : ,. becauseit can handlemany kinds of linear moa"r. specify which nust r.lel . r: w^,r.,ohr ,ri^,1^ """,,l"' 
inailil;il;ffi;] r1e ;J ;[ . .=':::1:i[:,:::::::j::*l1it .:::j.i."l:.'::11.lod:l poisson b":1,.r" r,vlntion distribution;"r.."*"",", 11" ".o"n,." variable. sp".ry'ngil ;;:i"#ffi ::s flr!rj..:. :',tL::,:"T":ependent a loglinear model. .: _ g"::
Ji,il; ;iHl;
the
g"._l
_. .".j::":.:::..1: slm_ command the ule Ee'Erdres senerares r r.,:::lrs ; rz.o. /{1"1 r.1 shown in the ;l:i,lirst line of,,.r^Lr,"1: r^ I rnen repeat the process for a model that lull :' : : p r
roc butnotro;nt.ra.tti;;f" ;;;;;r,",t"# J:$; _ .::j ,T9,_1,::lroberelated that
l:;uefficienrsshoran :o 9' ls,tansllagldcrisii.tn"." commands  _ :i].1:.:T::^'"':ll:n'nto on the bolromtine
oftaOte lZ.o ir"l lel g). o./. Clearly, \'rsdrryJ this . trus :r j= _i :!, rhp.,era L_, :. the data well ,6,, by all crite na..rnd::d so well as to suggestthat a simpler ,r^ model , __ fln in hr rhp ,.rar.,,,^ll r. : :> rhe remaining coefficients in the table. ..::cting these statistics, we see that none of these models fits the data adequately.
. ... rhis. resrimare a,,il....i*;.:i,l;H.; ,.irSll5,o,lli:""llJ:,*::ln
:!*. I serrleon IARS]IACIIRC][SC] asmy preferred 'n"J.i.'i",a*,fy, age,region, 
.
,..i,,
..: . .r.
coodness_of_Fit Statistics for Log_Linear Models of the
conmunist shourJ ,. *" speak in ff"^t11:lf:1*I.l*. hr Community, Age, Region, and Education, "ii"*.o ,.r, ,rn. flcc€i
"a"f,r, Lz
d.f.
BIC
L'ILtr
A
N1 r:;q r t
15_1 :]:;S]IA C]
o/t
1
.69
10.7
a_2
B :_:,sllRcl c::lll(at
87.75
6
.000
44.0
.44
; \)lrALt'KLt
84.j2
5
.000
48.2
.42
a : ; s l l A cl[5C]
48.69
5
.000
12.2
44.74
5
.oo0
4.2
.22
1.7
re.:lslJAcl[RC][sC] 2.92
4
.571
_26.3
01
1.5
r .:xsllRc][sc]
7.8
276
QuantitativeDataAnalysis:Doing SocialResearch to Testldeas
TA E , € tr 2 . 7 , erp..t"d percenrage(from Modet 8) Agreeing That "A Communist Should Be Allowed to Speak in your Cornmunity" by Education, Age, and R€gion, U.S. Adult, 1977. Age
Region
No College
College
::;;; : :::ri:I;::;: t.::.:..
::t*t*i (253)
:::
(182) 567 (46',)
47.8 (411)
74.3
(13s)
Noter Cellfrequenciesare shown in parenthesses.
andeducationall affectattitudesregardingcomrnunistspeakers. To seewhat theseeffac are, I percentagethe table of frequencies predicted by this model. (To see how to d these,consultthe downloadablefiles, ,,ch12_i.do',and,,ch12_1.log.',) Thesep"r""n,Jr" are shown ]n Table 12.7. The table clearly shows that, controlling for each of the odo. factors,thosewho arebettereducated,younger,andnonSouthemaremorelikely to su_ port the right of a cornmunistto give a speech.In eachcomparison,the percentasediafrencesalwaysgo in thesamedirectionandarequitesubstantial. The attitudes reported here are from thiny years ago. during the height of the Cifl War. It would be of interest to determinewhether the samepattern holds today.To do fui, within a loglinear framework you would need to construct a seconddata set, basedL recentdata(for example,the 2006 GSS),to appendthe seconddata set to the first. trt an additionalvariable(?l for ,.time,,),andthento assess whetherit is necessary ro posrt! effect of time (or of interactions betweentime and any of the twovariable associaticroq to adequatelyrepresentrhe data. That is, you would estimate [ARS][AC]tRCltsf,] tARSltACltRCltSCl[T], and IARSIIACTIIRCTI[SCT], and perhapssomeinrermedi.G models,and comparetheir goodnessof fit. If none of the more elaboiatemodelsproducd a betterfit than tARSltACltRCllSCl, which isjust Model g replicatedfor thepooleda.o you would conclude that attitudes regarding the rights of communistshave not chand between1977and2006.If tARSltACltRCl[SC][T] emergedasthe prefenedmodet. 1ic would concludethat there had been an acrosstheboardchange(preiumably an increact in supportfor the civil libertiesof communists.If IARSIIACTIIRCTI [SCi] emergedr the preferred model, you would conclude that the structure of the relationships bet$ ear age, region, and education, respectively,and support for the civil rights of commun:*
Log_LinearAnalysis 2V7 ::rsed between
1977 and 200 ::r;tude that the wourd strir;,#;:xl;, Tffi::ffi1T:"#ffiffi,:f,Tr""ffii;Jou
hing LogLinearAnalysiswith polytomous Variahles
orcommunists, ff:i",''**r* ;rn$.#*l ri'hts ffjiifi :'"rTtr{.il".Hffi ;#?:;'"":T"i"H1?ffi
L";;Tffi;[*i:ff;;#:1rr".'o*"u".i',rfi
::
:.*fi ,",ilir:?ffi ilHi"rHt Fir:*,##fifu #:ir;ilffi rnc ;"r"^n* il..'.l;fiiliiiiiJ;'l.l#::y lllociatign if':T'ili"Jll;
race'and membership are dichotomous =o^'.rtionis . ,rrt...ror"ri'"fJ, variables. but tto*ts thal we crealetwo durffny vari,f,!i:: s2 {   for high schoolgra,i.f1i.uno9"t :
0 otherwise) ands3t: 1ro.ttrose ,E1!; some '19 those wittrlt uiJ =dl'ri#lT' vrse)' with lackinghigh schoolgmduation "o "g" m:ed category. asthe Suppose we are inrerested ,":.lT:lrl,
a modelin whichrace,
educarion, and exampr"'" aono,"u." *3:;itll,"lt;fff';""t::?.iJ$il theprevious abour the membership re.cr ion amongth'e;;,
and hence
.*v,nrrysrrv_i,r jd;ffi ",:jiilffi ',}iil.il*1fi'X"rl'":",1il11#*"# ;;;,1:,", I .:,nd
m'.I1 this model with the _glm_ command: ln
count r s2 s3 m r vs3 vm, ram iry rpoisl2oJs3
rm s2m s3m rs2m rs3m v vr vs2
ryo=eachof the compound variables
ls a productrerm_for example, n rL rseethedowntoadable rs2 =r*s2, and files.thrz_r.irrn",oecificarion of ""d:.r,irir"*lil.
r'o,i;,,""r,jl ".,;.,',o."o"" ffij'l*#:;:il;l"fiii;;:?" too"p'"rulv rur provides u.i"rii"i".l *"r,i"i"1,*,?J:Tffii'_'#lJil'jffi""#:i'*1. . t9 LogLinear Analysis with tndividual_Level Data
,r:'T,lixlf ffi;;ffi ililtff il#"#$:i:'":1tti*i,:ilffi t; .!@r lisstanins rrom ;,;;ff;#Ti{*"*
*ljH ;:l$fff,"il1'nJ"_:y, ::,"::. ur@:nd.(Downtoadable file,,chlr_t.do,, shows ,rr"J"iarrl*""irJ"':in,r_r.,on.,,., MONIOUSMODETS r.:r we lave dealtwith _"0"r. *1:r.r;ri
global associarion. or absenceof global
n"_.,.".._.,,i.1,d :;,i";,ffj;1i..?3i**1,":r arhypotheses rike totesr regarding ,n" ":1,:r",.brt.n, ,,"_,",i"'_"lt :er tables can be described ,7,.rii[ll?J,l,,llili"l5;,,::] by relatlvely simple models that generate the observed
2V8
DoingSocialResearch to Testldeas DataAnalysis: Quantitative
 ,l A ;.,ii '1 7 ,3. r."q,r.ncy Distribution of voting by Race,Education, and Voluntary Association Membership. Didn't \i:
Race
n2 1. t
White
6
12
18 6C
Oneor more
24
lC
Sourcer Adaptedffom Knokeand Burke(1980,Tabe 3).
patternof frequenciesin the table.The developmentof suchmodelsto describepan;:n of intergenerational occupationalmobility hasbeena lively enterpriseoverthe pastlh: yearsor so,but the lbrmal modelsdevelopedin this contexthaveapplicationsfar bel r,ai the studyof socialmobility (for example,Radeletand Pierce1985;Schwartzand \1,:r 2005;Robefisand Chick 2007;Domanski2008).Still, it is convenientto illustratethss models in the context of mobility analysis.(Seedownloadablefiles "ch122.do' :r "chl2_)..log" for details on the Stata proceduresused to estimatethe models in:E remainderof the chapter) It is helpful to begin by deriving a generalexpressionfor log odds ratios. Re:rEquation 12.4, which gives the natural log of expectedfrequenciesfor a twovaril'rd
LogLinearAnalysis279 From Equation12.4we can write an expresririe as a function of a setof p,parameters. mE tbr the log odds ratio of the expectedfrequenciesfor cells formed from any pair of m. ri andi') andcolumns(j andj') in a twovariabletable: !
or: 
F..F,... F,,1F,,. loe " " : los v ''J losfl.loeE, " Fri  F,iF,j lFri
 loe4,  loeE.,
 (rL+p! + pl + pnc)+(tt+pf + pf,+ pff) @+ pf + pf + pl9)(p+ pf + pf + uff)
(12.e)
 tf, + pff  ptP pff
I
lfher dummyvariablecodingis used,asin Stata's glm command,andi' andj' arethe !*r.nce categories,theright sideof Equation12.9simplifiesto Pfc, which makesclear h de interaction parametersrepresentthe log odds ratios for each cell relative to the ,rined categories(ordinarily the first row and first colunn). \ote that to uniquelyidentify the coefficienls,it is necessaryto imposeconstraints. bc differentconstraints,or "normalizations,"are typically used.One is effect coding pal in Equation 12.6andAppendix 12.,4.1, coefficientsas deviations which expresses fu rhe grand total by requiring that the logform coefflcients for each variable sum to rF:, The otherconstraint,dummyvariablecoding,codesonecategory[in Stata,the flrst csonl of eachvariableaszero.) Il the fully saturatedmodel thereis a unique coefficientfor eachcell of the table q)=t. with dummyvariablecoding, the cells in the first row and first column. This (for a sevenbyseven table): by the following designmatri"x. mdel canbe represented
1 1 1 1111 12 3 4567 l 8 9 15 r 14 21 120 r 26 2728293031 13 2 3334353637
10 16 22
ll 17 23
12 18 24
13 19 25
: full dm
@ lQl
dtr h F 'd ]U lEl ri$
fu rhat a design matrix is simply a variable,with one value per cell, that imposes alFrlitr'constraintson somesubsetof cellsa1lcells with the samevalueareconstrained :cne equalcoefficients.This designmatrix specifiesthat all the coefficientsfor the first ind first columr areequal;in fact,they are (implicitly) zeroby vinue of the dummyr mble coding.Noneof the remainingcoefficientsis constrainedto be equal.This model m; dl the availableinformation,andthe observedcountsin eachtableare f,t exacdy. \ote that in Stata'sg1m command,the specification ..:::glm count i.X
i.Y
i. full _dm, family(poisson)
280
euantitativeDataAnarysis: DoingSociar Research to Testrdeas
ffi lf ,]i,;f;.';::T.T:rrrT"ibutionoroccuparionbvFatherr Respondentt
Prof. cadre cler. iJes
in 1996 Ser.
117
Man.
810
Agric.
2,765
producesresultsidentical to the usualway of specifying the saturatedmodel: xi:g1m count i. X*i. y, family (poisson) That is, gIm creates a desisn "b" matnx like that of 'full_dm,,when specified. the interacdcn
,,iil:,T::: yifi:r:il::;i{_j_", ffi a; :.{,i# fi :T}j,T,,: fi *o "?, ^ oo.','lnt"o',"" ; il:::"'ffi':J oio.no;^ ;"i*t:'ff o,. il::::i"t.," women ^i,,.
*.** women roincrease ro I havepooledmen increase ,f,. rhe ."ln","""#, *,0r. L?l juffi;":,fii:"jllf,iJffi:::,..::.# ,TLifll. ,reparatety. s,.iar
two.wayrabre:tharrr,"*"."_*uuini""ffi ll,: ;'i;3#:r::,il1#:',ily?":."j:::::l':: orathrceway ,ab,e i* tt"!:ibrTiiy f::i:;';?f :i,THtr ii!:X":i;yffi j i.'io8i,i '*"iJ:,ffi r,i'^." ro,esr "1,::::1r:1 i",r, the
nrsr condi*".,.il;, ff 1  ,#;.1ff;,ji;liir,l;a,. ""il1,J:fi ttri:L'Ttli:l il;;ffi ,fr ner o:h
ilflruffiff
,''r.,",,"iJ,.l and women {rhr ?olr,"y,oun.,"" G
nand R=;;;"oJ;l'":#ffi f :,nee .il"l.:H:nm: ;# l':"i?:.rufi ffi,rffi ,:i,ri;'f lffi*,l,,TtH1ffi
.#.iiff +*:f 11X11*"
LogLinearAnalysis28'l x.raly marginally significant.Given the relatively large sizeofthe sample,I am inclined to ttus on the BIC ratherthan thepvalue andconcludethat the first conditionis satisfled. To test the secondcondition, I contrasta model (call this Model B) thar omits the interEion betweensexandfather'soccupationthat is, [SR][FR]against Model A. The subIllrive argumentfor this is that in China, where almost all women are in the labor force, u€ shouldexpectno differencein the distribution of father's occupationfor employedmen ml rvomen.To contrastthe two models,I take the differencein the 17 and the differencein fu degreesof freedom to get the pvalue for the improvementof goodnessof fit resulting frm the addition of [SF] and also get the difference in B1C values.Although the fit of lf,rlel A is significantlybetterby classicalstandards (p : .019IL  L: = 67.18 52.03 =15 l5;dl, dJn:42  36 6l).ModelB is morelikelygiventhedaratBle BICA = 185.9  [250.6] : 35.3). Again,I am inclinedro put moreweighton theB1Cdiffoence and concludethat the secondcondition is satisfied.Thus I am willing to pool men nl $ omenfor the subsequentanalysis,which effectively doublesthe samplesize. Table 12.10showsthe coefficientsfor the saturatedmodel (see"ch10_2.do,'tosee h* thesecoefficientswere computedusing Stata).As we haveseen,thesecoefficients re not readily interpreted directly. However, in the present caseit may be of interest to .=nnast particular cells in the table. For example,we might ask about the relative chances r de child of an agricultural worker becoming an agricultural worker insteadof a mannal rr orker comparedto the conesponding odds for the child of a manual worker From Ewation 12.9it is evidentthat the log oddsratio can be computedas
to g9=p+f+pt pf{p f : 2.756+ 1.567 1.088 .80 = 2.434
(r2.10)
dt{h implies that the relativeoddsare 11.40(: e2434)' that is, the childrenof agriculErl workers are more than eleven times more likely to become agricultural workers thselves, rather than becorningmanual workers, than are the children of manual work6x" Similarly,the oddsthat the child of a professionalwill becomea professionalinstead rr rcorning a cadre,comparedto the correspondingodds for a child of a cadre,are
to g9=,y y +pl ft"ffpt =O+.62700 : .677
(r2.11)
ffir,h implies that the relativeodds are 1.87 (: eo621). Clearly,in China (as elsewhere) E inieritance" of farrn occupations relative to inflow from the children of manual cr*ers is much stronger than the inheritance of professional occupations relative to dt* from the childrenof cadres.
fulogical,
or Levels,Models
filrriag shorn how to interpret the interaction coefficients, I next addresswhether the nilb can be simplified.In pafiicular,given the lack of differentiationbetweensalesand
':
rXf,af i;:..!S"Interaction Fathert Occupation When R Age 14
parameters for the Saturated Model Applied to Table 12.9.
Prof.
Cadre
Respondent,sOccupation in 1996 Clex Sales Ser.
1.213
0.169
0.054 15qq1
1.489
6. Manual workers
1.595
Man.
Agric.
,0.100
o.341
0.384
0.058 0.607
Log_LinearAnatysi,293 serlice workers in the Chinese ecr ronably be collapse;l;;;;"::mv.'
I s:s,pejt that these two categories might rea
c.u.invorui'g.iieil;#o;;iKTliff ,"'ffiTili:r*;ffi ,*:f,:";#. 1 111111 ::54456 , 
6
:!2814v15rc :, : ? ! 1rt td
t z z 2 3 2 4 )q 2 s2 6
9
9
14
14
l9
,g
l0
ti = ss_dm
1 5 i; ZO
21
16.06,.with 11drbecause oryrwentynve orrhe
ffi:nT:'#ffif;;"::ii:i.,r
d^:53),';;'";ffi ;:ni_",,,1T#il;";#:lt3i;fi *:3,:l?;"J,"_? asasevenbvseven rable. r,il.;;';il;e
ffiT;:$:i^lell
*"t*t;H#;l':f
subsequent anarysis
ff i:::::.:":llTpyr:*'voushourdkeepinmindthis you are trying to decide .'"t
"t,Tfl Eegones ofa tabl". Th" o.o".d.tn"never
to
; J'#':"':".?1T"':::*ffi :"3'lTii: Tf,:Ti:,'#"'fi,'ff:* SliT;i:;" "tt"t
"otffi
ceilsofa tabre ashaving *#?Ji,,tix1?ill;:}f,ffjt:iu":.:f panicurar identicar
m.*".pL.,',""ir"#;'il;?,1'r"r.j:;:ff :.T:lTH,:;,,[:n".J*r""ir"]ir, QnsilndependenceModels
*1'#::i
;rT:T;:,ftlf#j
if georr.e areabre tofreethemserves fromthesociar
u:n*::ll.* *lhlt""lxiliiT:""Ff,1,;:i*t..';i:t1:?fJ:ffi
(onthe hpothesis couuo,"a .i*_oulJ*.,11!lil'o."ifi [Uffi:Hffi:1""}.;fffi1ff:
{egonar ce's of thetablebut otherwrse fbrcesa'interaction parameters to be
identical:
2111r1
r3r1ii 114111 111511 llll6l lltt;;
= diag_dm
.Asse canseefromthefirstrowol^tf s3c3nd.panel of Table12.1l,thismodelis a huge rprovemenrovertheindependencl mgde],*fri"f, t ,fr" U"*Ur"model in Thble12.11. lbough it doesnotfit by ciassical stanou.os. ir is mor. tit"i, ,ir_ *i'**r,"0 moderand rnisclassifies about2 percent of thecases. S l. other.JO.llr"igh,ht evenberter r'
i
'irlili:
': ,:.,I :
statislics for Alternative Models GoodnessofFit in China (Six'bysixTable)' Mobility Occupational of Intergenerational
.
B,c
L'?lLl
.000
869
1.000
': fi
000
109
054
:i
nj 2
.58/
' :w '
L'
d.f.
p
1080
25 20
58.8 oJ+
14
451
24
.000
249
.418
urban hukou Line!rbylinear,
157
24
000
45.2
145
:g
Linearbylinear, lSElr urbanhukou
150
23
.000
* 43.8
.139
I !i
324
16
.000
190
,300
^^n
14
Row_andcolumne{fectsll (RC)
^^n
:
6
.098
 106
.050
,;
.020
'_:
117
.432
' :
 11a
.031
':
 14
Diagonalcellsfitted exactly
.000
Quasiiridependence
.016
62.4
Quasisymmetry
21.a
10
Crossinqs
)o I
16
n)t
Uniformassociation
34.5
18
.011
l)tl Llnearoylrnear,
33.7
18
A1A
urbanhukou Linearbylinear,
37.2
18
.005
114
.034
.Linearbylinear, t) + uroannuKou
33.7
17
n6q
 10q
,031
RowandcolumneffectsI
10.3
10
.415
9.0
10
Row and columneffectsll (Rc)
1nq
73.4
_ ' a
LogLrnearAnalYsis285
CluasiSymmetryModels
t important issue in social mobility researchis whether' net of any shift in the G:sinals, the relative odds of upward and downward mobility betweencorrespondn: !ategoriesare symmetrical.The following design matrix specifiesthis model for te :irbysix table: :11111 3 1 8 I i917 10 1 1li14
9 12 515 15 16
8 4 13
1l 14 16 17 7
10 13 6 17
: qidm
ts ;\e seein Table 12.11,this model fits slightly better than the quasiindependence riel by the likelihood ratio standardbut not nearly so well by the BlC standard'
CmssingsModels tableaslepresentwe wereto takethe occupationalcategoriesin our sixby_six Sdr.r,ose Supposefurther mobility barriersto = .ocial classes,with boundariesthat constitute "cross" eachbarto llrl in an analogyto movementacrossphysicalspace,it is necessary E ttween adjacentclassesto achievemobility betweennonadjacentclassesWe can sr::ent this model (following PowersandXie 2000, 117)as
(rz.t2)
F,,= nrlrl ufc
riuu fori > i j l
il
uu fori < i
€i
fori:
t
fu* ,pecificationimplies the followhg interactionpalametersfol the cells of the sixbysrx mie rivith the diagonalcells fitted exactly): q1
\,
E.
F
E,
),
to Testldeas DoingsocialResearch QuantitativeDataAnalysis:
286
one for eac These parameters can be estimated by summing six design matrices' and taking parameterplus one for the diagonal design matrix (diagdm)' ir estimated is exactly "*ring. diagonal the fit not antitogs. f.lle conesponding model that does desigr five the are Here omitted' is matrix ttrat ttre diagonat design th" ,ui *uy " "*."pi crossingsparameters: matrices for the 011111 100000 100000 100000 100000 100000 crldm
001111 001111 110000 110000 110000 110000 ct2dm
000111 000111 000111 111000 111000 111000 cr3dm
000011 000011 000011 000011 111100 111100 cr4dm
000001 000001 000001 000001 000001 111110 cr5 dm
Ij*
rfr @4
the othermodels'rc As we seein Table12.11,the crossingsmodelfits betterthan any of degradesthe ft exactly cells diagonal the have reviewed so far. Interestingly, ntting
movingb"tY":t rt:,jiii:li because presumablv .tigt,tyUym" AC standard,
3
5
0.138 0.002 o.203 0.228 1.033
farm and nonfarn Clearly, by far the most difficult transition (crossing) is between and China is m everywhere' is true this o".uputOnt (specifically,manualoccupations); cadre and clericd between is exception. Interestingly, the least difficult transition distincticr sharp no Chinain occupations.Again, this is no particular surprise, because of tbr the brightest and best is made between clerical and administrative tasks and the mobilig clerical staff are often tapped to become cadres' The known intragenerational positions seenas pa$em may well carry over to intergenerationalmobility,.with clerical cadre positions ieasonable starting points for the children of administrative cadres and Finally' thb as aftainable upwld mobility goals for the children of clerjcal workers' lt could females and males combines here result could be due to the fact that the analysis workers' clerical to become tend well be that the daughtersof cadresdisproportionately
lJniform AssociationModels
tut
T1:"
weil by the crossingsparameters,ard the additional degreesd i. Jiu"gona "uptured ""U, freJdom usedby fitting the diagonal ex actly arc penalizedby BIC ' The crossingsparametersfor the simpler crossingsmodel are
2
fi
rd
parsimonious When the cateSoriesof a table are ordered.ir is possiblelo eslimalemore model assumesthl such simplest The models than are available for nominal categories'
I
EL:r !i
h*
*t I5lrtrE
r [.d l&lr d:r
G
m{ @ dEF trtd
dfr
LogLinearAnalysis297 te.differencebetweeneachpair of adjacent categoriesis equar,so thatthe scalefor each uiable can be represented by consecutiveintegJrs.rr,"t iiii" .#r i,
togF..= p+ p! + pf + Bij
(12.13 )
rtere the strengthof the association betweenthe row level and the colur* level is '";red by F From this it follows that the log odds.",i" u"."#."* .ategories r and .e: .olunn categodesjandj, is just '
to g 9 =B G0 U l )
(12.14)
Table12 11 showsgoodness_of_fit statisticsfor the uniform assoclation model with .rl . c.ithourrhe main diasonalfitted exactly.As y"" .;;,;;;;;iagonal cells are nor ft :ractly, the uniform aisociationmodel hts u".y luAfy. ffr" ."^on tbr this is simple; F'ole disproponionatery tend to remainin the sameoccirpu,#ii ,heir fathers. Eh. tendencyis capturedbv fitti "go.y ^
p,.r,..",,L',."J;i;""i#':,,1f, ,il""*HiJffi ffH',:u:ii;?:T:i, :,ft::[:
&gonal cellsare estimatedexacflv. when the diagonal cells*areestimatedexactly, {y.""t rhe
umlbm association *:ll. It vieldsB : .046.FromEquatio;;l;;;;; ."" tharthisimplies, TII!l':"1t: e:\ample, that the log oddsthat the child of a professiooutrvilti""o_" u protbssional the corresponairg for the child of 1.150; s,,50: 3t5Si;:ii.i",i"i "OO.
ffier than a farmer are more than_threedmes
nnrmer:.046(1  6x1  6) = ,"*,nerlow odds m,'' whichis consisrenr with thegeneralr"nr" ,rruiini".g"iJrationar mobilityin is easier thanin mostorhernaionsdr;:;;w;#;;:#ii, ^n:..a 20071 trooo, fora .w:erargument). tfua rSy1in"" r Association Models
. cr ruppose we have more information than simply a rank order of categories_for **rple, socioeconomic statusscores.We can then estimateu iin"*Oy_In"* u..o"iu_ m nodel, wherethe scalescoresaresubstituted for thec","g"fi"O"*r. LLrat is, instead rLluation 12.13,we have
logF,, : 1t+ pf + Lrf + p\yj
(12.r5)
*fr ae log oddsratio givenby 1og0:B e,x)(t1
_t)
(12.16)
Esrimatingthis model for the,Chinese data,with occupationcaregonesscored by filhr meanoccupationalstatus(ISEI; see Ganzeboom,O.'Cr""i ano Treiman 1992).
288
to Testldeas DataAnalysis: DoingSocialResearch Quantitative
we achievea model that fits marginally better than the uniform associationmodel. bg B1C criterion. For this model, B : .000483. Thus for the samecategoriesas in the form associationexarnple,we have.000483(16.2 63.7)(16.2 63.7) = 1.090;e:! 2.974.We areherebyled to a quaiitativelysimilarconclusion:the oddsthat the child professionalwill becomea professionalrather than a farmer are about threetimes as as the correspondingodds for the child of a farmer. Note that it is possible to include more than one scaling of the categoriesof a to representdifferentconcepts.Table12.11 showsgoodnessoffitstatisticsfor two tional linearbyJinear models, one of which scales occupations by the proportiu incumbentswho havepermanenturban regisffation (urban hukou) andthe other of usesboth the ISEI and urban registration measures.As it happens,neither fits as rrell the ISEI and uniform associationmodels.However.if we wishedto assessthe los ratio using, say,the model that includes both measures,we would simply apply 12.16to both variablesand computethe sum. (For a wellknown applicationof this ki model,seeHout 1984.)
Row Effects(and Column Effects)Models Sometimeswe are confident that one variable can be scoredwith an integer scalefr1 is, that the difference between each pair of adjacent categoriesis the samebut we l uncertair about how to order the other variabie. ln such cases we can estimate tr untnown scores.In this model the expectedfrequenciesare given by
logFij = tt+ p! + LLf+ ift
(llrqi
where thej index the categoriesof one variable and the d. are the estimatedscale sctc for the othervariable.The los oddsratio is sivenbv log0:tS,fi.t\jj')
(1114
As an example of a situation in which theseconditions might hold, considerthe r* tionship betweensize of place of origin and educationalattainment,for the 1996 Chirp surveywe havebeenusing. Table 12.12showsthe bivariatefrequencydistributionfu adultsnot currentlyattendingschool.In constructingthistable,I havecollapsededucarir so that the categoriesrepresent approximate threeyear intervals in median schooli4. The sizeofplace categories are from the official administrative hierarchy of Chir. which sffongly affects the flow of resourcesto places. Thus, in addition to the geDcd advantageof urban residencefor educational attainment (greaterexposureto the wrira word and such), we would expect educational attainment to be greater for placeshish in the administrativehierarchybecausesuchplacesarc the beneficiariesof more resourclr from the central govemment. The roweffects model fits well (BIC : 135, L : 2.96) although not ! classical inference (p < .000). But contrary to my expectation, the estimated scrEi
LogLinear Analysis Ll j
289
':.:
:
rn:''I
:': .
'.1 :?,1?, FrequencyDistribution of EducationatAttainment Size of Place of Residence at Age Fourteen, ChineseAdults Not Enrolled t rr Seto make too much of this becausethe confidenceintervalsoverlap(the 95 perqr:onfidence intervalis 0.71 to 1.01 for countylevelcities and 0.63 to 0.84 for re:::turelevel cities).
790
QuantitativeDataAnalysis:Doing SocialResearch to Testldeas
Columneffectsmodels are formally identical to row_effectsmodels, but with role of rows and columnsreversed.A columa_effecrs model of the relationshipbetc sizeof placeat agefourteenand educationalattainmentdoesnot fit as well as the c{ spondingroweffecrsmodel (B1C:  108,A : 2.98, andp < .000),which suggests the_assumption of equal scaledifferencesbetweenadjacentsize_oi_ptace categories probably inconect. This is hardly surprising given the dlviation from equal diff.erencc: the estimatedcoefflcients for sizeofplacecategoriesin the row_effects model and. cially, the nonmonotonicity of the scoresrelative to my a priori ordenng. RowandColumnEffects Model I Another analytic possibility is to treat both the andcolumneffectsscoresasunknownquantitiesto beistimated.However, in this cr is important to have the correct ordering of both the row and column categoflesbe, the results are not invariant under different orderings. For the Chinese example we been exploringthe relationship between the size ;f the place of origin and educaticd attarnmentthis createsa bit of a dilemma. Is it better to reorder the size_of_placectr gories according to the scale scoresderived from the row effects model or to retain rb.l priori orderingderivedfrom the Chineseadministrativehierarchy? One possibilityL.bney 1999,78)i and as noted,8/C is not available rre optimal solution may be to treat clusteredsamplesin a multilevelcontext, estimating er*rerfixed or randomeffectsmodels(Mason2001),which can be done in Statausingthe go beyondwhat can be coveredin this book, :{t or gee command;theseprocedures to Althoughevennow much 3(n seeChapterSixteenfor a briefintroduction multilevelanalysis. . ':ra:_ljournals, and treats simplyignorescomplexsanpledesrgns rat is published, evenin ldading this is generallyinappropriate cata as if they were generatedby randomsamplingprocedures, in its variForthe oresent,lsuqqestfor loqisticreqression 4d can leadto incorredinferences.
:us formsthatwhenyouhavedatathatareweightedorclustqedyoucarryouty!u!estimatigl '.:relvon adtustedWald tettilor modelselection. +:
iindittata3 survevestimalioncommandsand + Onlywhere 3e cautioui,however,in your interpretationand exploreallernativespecifications. you usethe  logistic  commandand random sample should have a true, unweighted, ,ou ikelihoodratio test (lrtest) . Further,wheneverpossible,eschewweightingin favor of rxluding the variablesusedto createthe weightsin the model.
J
niquesfor makingthe generalshapeof a distribution clearby removng " no ," " d"ui"tions $ from the underlyingtrend that resultfrom samplingerroror id osyncratic factors.Perhapsthe simplest smoothertsa movingaverage. A movingaverageis the average valueof several consecutlve data points.Considerthe workedexamplein this secton. A threeyear moving averageof the expectedprobability of marriageat eachage would be constructed by first takingthe averageof the expectedprobablitlesfor agesfifteen,sixteen,and seventeen; thenthe average of the expectedprobabilities for agess xteen,seventeen, and eighteen; and so on. At the time the ageatfirstmarriage examplewas created,the Statasubcommand ma ("movingaverage")was available within the egen command.However, this sub(although commandisno longerdocumented in Stata10 it stillworks),andhasbeenreplaced by smooth . whichgenerates mediansof the lncludedpointsratherthan means.Another smootheravailable in Statais lowess .
il .tt
a'nBlacks (precisely,0.591 : 0.190*3.108).Among 3Oyearold nevermarried people, fu oddsof marryingin that year amongthosewhosemothersare collegegraduatesare r:rlv 10percenthigherthan the oddsfor thoseof the sameraceand sexwhosemothers (precisely, r: highschoolgraduates 1.094: (0.918*1.114)4). Despitethe usefulnessof Table13.6for making specificcontrasts,the overallpattem qlied by the coefficientsis difficult to discem.Again, graphshelp. Figures 13.4 and of the expectedprobabilityof first marriageby age :i showthreeyearmovingaverages f, isk. separatelyfor BlacksandnonBlacks.In eachgraph,separatelines are shownfor tri.resand femaleswhosemothershad twelve and sixteenyearsof schooling(as a con€=ientway of visually representingthe effect of mother'seducation).Moving averages r: shownbecausethereis a greatdealof "float andbounce"for individualyears,which F lident from inspectionof the coefficientsin Table 13.6.(Seethe downloadablefile *:13_2.do" for details how on the moving averageswereconstructed.) InspectingFigures 13.4 and 13.5, we see that mariage rates for Blacks differ mnntially from thosefor nonBlacks,with Blacks much less likely than nonBlacks @:lirry at all. Moreover,nonBlackfemales(especiallythosewhosemothershaveonly ,I rsi schooleducation)marry at disproportionately high ratesat agesnineteenthrough lDdxn five; nonBlack males marry a bit later and with less concentrationin a short FL{. Black marriagerates,by contrast,are spreadout over a much longerperiod,but rrE ar upsurgein marriageratesfor malesin their thirties,especiallythosewhosemothG ire high schooleducated.For both Blacks andnonBlacks,malestend to marry later k remales,with male ratesexceedlngthoseof femalesbeginningaroundage thirty. lirrJl. amongall racebysexgroups,thosewhosemothersarehigh schoolgraduatesare m3 likely to marry than are thosewhosemothersare collegegraduates. Ir I werepreparingtheseresultsfor publication,I would presentonly a subsetof the fter large set of tablesand graphswe havejust marchedthrough.The intent here,of
to Testldeas QuantitativeDataAnalysis:DoingsocialResearch
326
. 18 . 16
F e m a l e(s1 2 ) .o Males(12) F e m a l e(s1 6 ) _ Males(.16)
\
,/ . 14
6
i .os E p .u o
.04 .02 0 15
1/
21
19
23
25
27
29
31
33
35
Age at nsk
PtGtJ*i: 13'4, r"pecteaProbabititvof Marrvingfor the FiRt Timebv Ase Sex,andMother'sEducation(Twelveand sixteenYearsof Schooling)' at Risk, U.S.Adults,1994. NonBlack
. 18 . 16 . 14
(12) Fem ales  o.o Males(12) (16) Fem ales Mates(16) 
E b
9 .oe € .o o
rr.r.Q,
.04 .02 0 19
21
23
25
21
29
31
Age at nsK
of Marryingfor theFirstfimebyAge Fl€URg 13'$. etpect"aProbabitity at Risk,Sex,andMother'sEducation(TwelveandSixteenYea'sof Schooling)' BlackU.s.Adults,1994.
;d nl !t
BinomialLogisticRegression 327 is to providealtemativesfor you to considerwhenpresentingyour own analyses. of the application of discretetime hazardratemodels include Astone and oth1J00),Dawson(2000),Lewis andOppenheimer(2000),and Sweeney(2002).
FOURTHWORKEDEXAMPLE(CASECONTROL MODELS):WHO APPOINTED TO A NOMENKLATURA POSITIONIN RUSSIA? a dependentvariableis a rareevent,it is inefficientto draw a representative sample populationat risk for the event,becausethe samplesizewould haveto be extremely to obtainenough"positive"casesto analyze.This is a frequentoccurrencein epideical research,where the eventsof interestare diseases,but it also occursin the :ciences.For example,if we are interestedin studyingwhat determineswho gets to Congress,we could hardly do this by drawing a representative sampleof the ion andlooking for the congressmen in it. We havesimilar problemsin studvins crime victimization,homosexuality,and variousotherrelativelyuncommonpheOne solutionto this problemis to sampleon the dependentvariable(that is, to a sampleof congressmen, criminals,or homosexuals), collect informationon that collect oie. correspondinginformationon a representative sampleof the population 'lrs not experiencedthe rareevent(becomingcongressmen,criminals,or homosexuals), the two samples,and model the odds of experiencingthe rare event.This is ascasecontrolsampllngin the epidemiologicalliterature(for an excellentreview itatisticalproceduresinvolved,seeBreslow [1996]). C3iecontrolsampling exploits the fact that odds ratios are invariant under shifts distributionof the data.This extremelyimportantfeatureof oddsratios makesit to combine sampleswith very different distributionson the independentand variablein orderto modelrareevents.This capabilityis not possiblewith OLS becauseOLS coefficientsare affectedby the distributionsof the variablesin n]del. T.r see how case control procedures work in practice, let us consider what factors
the oddsof becominga memberof the Russianpolitical elite at the end of the ist era. From Social Stratification in Eostern Europe after 1989 (Treiman and samplesfrom Russia:a probabilitysampleof 1i 1993),we havetwo representative ,< population(N : 5,002)and a randomsampleof personswho werein nomenpositionsasof January1988(N = 850).(SeeAppendixA for a descriptionof the .md informationon how to obtain them.)Nomenklaturapositionswere thosethat the approvalof the CentralCommitteeof the Communistparty. They ranged rery high govemmentofficials (for example,membersof the politburo) down to of sensitiveorganizationsfor example,rectorsof universities,editorsin chief of newspapers, andheadsof largeindustrialenterprises. Th generalpopulation sample departsin two ways from compliancewith the ions underlyingcasecontrolsampling,but neitherdeviationis importantfrom standpoint.First, it is a probability sampleof the 1993 populationrather tb 1988population.However,the samplingframe is basedon the lg89 census,and nmple thereforeprobablyrepresentsthe 1988populationnearly as well as it does
BinomialLogisticRegression
II
329
Before tuming to interpretation of the results, we should note the one difference hween casecontrolanalysisand ordinarybinomial logistic regression:in casecontrol aalysis the intercept is not meaningful. This should be obvious from the fact that the in logistic regressionindicates the proportion of the sample that is "positive" rid respectto the dependentvariable. However, in casecontroldesignsthis proportion ercept b ixed by the sampledesign, and thus the coefficient addsno information. Inspectingthe coefficientsin Table13.7,we seevery largeeffectsandfew surprises. Ech year of schoolingincreasesthe odds of becominga memberof the nomenklatura b more than 70 percent. Thus, all else equal, university graduates(who typically have Li yearsof schoolingin Russia)are more than 15 times as likely as high schoolgradu(with 10 yearsof schooling)to be appointedto nomenklaturapositions(precisely, s l5i2 : 1.72605r0)).The effect of genderis astronomical:malesare more than 17 times G likely as females to be appointed to nomenklatura posts. The effect of age is also anemely strong: all else equal, the odds of being appointed to a nomenklatura posrtion i;rease about 14 percentper year.Thus, for example,a SOyearoldis more than 7 times hkely to securea nomenkhtura positionasis a 35yearold(precisely,7.23 : 1.141(5035)). Itrhaps more interesting, the effect of social origins, evenamong thoseequally well educred, is far from trivial. Coming from a family in which one's father was a memberof the Communist Party improves one's chancesof a nomenHntura appointmentby about half, d elseequal.Also, eachyear of father'sschoolingincreasesthe oddsof nomenklatur.l qpointment by about 11 percentthis in the worker's paradise!so that the offspring of t universityeducatedintelligentsia (15 years of school) are about three times as likely * the offspring of those with only a primary educationto sectJrenomenklatura apporntlmts, inespectiveof their own educationalachievement(precisely,294 : 1.114(s5)). rllone amongthe variableswe haveconsidered,father's occupationalstatushasno impact r the odds of appointmentto a nomenklatura post.
XHAT THISCHAPTER HASSHOWN h dis chapterwe have seenhow to estimateand interpret binary logistic regressionmodds which are widely usedto model dichotomousoutcomessuch as whether people vote, employed, or are members of a particular organization. We have seenthat although t estimationproceduresare quite different, the interpretation of the coefflcients of such ndels is similar to that of OLS regression, except that the coefficients represent net &cts of eachindependentvariable on the log odds of an outcome. Because log odds are not intuitive quantities, we have considered two nonlinear :nsformations to more readily interpretable coefficientsoddsand expected problllitiesand have also seenhow to graph net relationships, a form of regressionstanfor logistic regression.Finally, we consideredthree extensionsof the basic &ization models, listic regressionmodel:educationprogressionratios,discretetimehazardrate d casecontrolmodels.A notable feature of logistic regressionmodels is that they are with respectto the distributions of variables in the sample,which is what makes procedureslegitimate in the logistic regressioncontext blrt not in the OLS Ge{ontrol aiant rlresslon conlexr.
330
Quantitative DataAnalysis: DoingSocialResearch to Testldeas
APPENDIX I3,A SOM: AIGIBRA FORLOGSAND EXPONENTS who have forgotten their school algebra, here are some usefirl Io. 9lr" rnvorvlngnaturallogarithmsandantilogs(exponents): e'lt) : X
h(x*r):h(x)+h(r) ln(X /I) : tn(X)  ln(f) X * Y : et^6) etn(Y): e(L(x)+ln(Y) e :'essionalin 1988
0.9943 (.1s48) .000
1.124 (.2990) .000
1.3856 (.3s77) .000
 5.5378 (.3021) .000
8.I 541 (.5036) .000
 10.1965 (.5866) .000
!: ::1
t : :r
y'ariable : r ts (b)
:.___ :. _
:tL
::J
: , .'; :
: {li
. :. ':  l
:
: r* :
'= l
:'' u :::_.:
:
339
J.'
28.3602 (.7039) .000
:u ':L:il
::: multipliers(d) '::.s of school ::oleted
1.248
1.363
1.508
: : a Communtst Party :ber?
1.353
0.408
i.050
\Cantinued)
340
QuantitativeDataAnalysis:Doing SocialResearch to Testtdeas
Y&PLi:
1,6, t , ef."t parametersfor a Modetof the Determinants of
English and Russian Language Competence in the Czech Repubti(, 1993 (N : 3,945). (Standard Errors in parentheses; p_values in ttaiic.) (Continued) Variable
Russian
Othermanagerin l98g
2.702
English
2.38
I about a thfudhigher than the odds that they spokeneither language, whereasthe oddsdlx Communist Party membersspoke English but not Russian i" Jniy uOout40 percenrr great as the odds that they spoke neither language.Thus the odds that Communist Fa.l membersspokeRussianbut not English are more than thre" ti*", u, gr.ui u. tt;;il ; they spokeEnglish but not Russian(becauseecozor.txr)) : 1.35410.40g = 3.316).Th sameis trxe of service as a govemment or Cornmunist party official. Here, as expecr.;, officials were nearly five times as likely to speakRussian urt to speakingneitba 1ln Russian nor English) than were those who were neither "ont managers nor proibssionalsl: technicians(recall that the referencecategoryis all other occupatrons). The odds d;r g?YeyTrent officials spoke English or both Russian and English are effectively zero_ which they should be becausenot one of the sixteenofflcials ii the samplespokeEnelin Fin{1y, yi seethat being a professionalor technicianin 19gg roughly triples the od6 d speakingRussianonly or English only, andqladruples the oOAs of spJakingbottrEnglishmd Rl^ssian,relative to speakingneither English no. Russian. By coniast, being a managern triples ttre odds of speakingRussianonly, relative to speaking neither.Bur c l3!9 ":t,l erect or bernga manageron the oddsof speakingEnglish or of speakingboth English mc Russianwereboth somewhatsmallerthanttte effectof 6eing a on tt oddsof spez&ing Russian.A1so,the coefficientsareonly marginally significant_alt " 0.1 ig", aboutthe level. Althoughfor this exampleI settledon a singlemodelin advance, model selectionfir . multinomial logit modelsis carried out in exactly the same way as fbr binomial lcrs_ modelsby taking the ratio of the differencein Z;s (Modef XrO'to tfr" Oiff"..;;;;; de^grees_of freedom for any two models,to determinewhether one model fits the data srcnilicantly betterthan the othermodel (but recall that this p."""d";ir;;;;;;;;;; robustestimationis usedthat is, whenthe dataareweighted or clustered;rather,a \\ais testshouldbe usedto comparemodels).
lndependenceof IrrelevantAlternatives In the_multinomial logit model, the relative odds of being in two categonesare assumedr be independentof the other altemarivesincluded in the riodel. This fJllows from Equari.r 14.1,flom which we canderivethedifferencein log odds for two categories, d andc, a.
Multinomial and Ordinal LogisticRegression and Tobit Regressio n
.'LuurJ ''[##J:1"*2u,"r) 1",
Bot B E:
i_1!E
''t be ia :rt rr [ ::r fmr nr< 6.
:aL:
nt
E    _ ,f 3: 3L:PJdtL:::€liF
fe.:: :,:,. n Tx be s:! n3l .:nk: E:::* s cr. .\i'.5 ir h Er:::: a a T;i;::
I
i6er. B:: @ r Ergr: m dL. o: :n0,1ie'.::r' le:::: ino mr:.:'s iete;i;e :: :it ; de ia q. f,rs'ible tiE' 121i1313 \\ arD
re LisuiDei
II
ion Equ::.:r . .imdi. =
341
(14.2)
\.rte that only the two categoriesbeing comparedenterthe equation.If, however,the rela::;e odds do depend on what the altematives are, the model produces misleading srimates.To seelhis clearly,considerMcFadden's(1974) wellknownexampleof transartation choice. Supposepeoplecan travel to work by bus or by car and that half choose t go by car and half by bus.Now supposea competingbus companyestablishes buses r:ih the sameroutesandschedule,so we no longerhave,say,only blue busesbut alsored r.es. Presumably,the half that traveled by car would continue to do so, but the half that :.r'eled by bus would divide equally between the red and blue buses,taking whichever ri showed up first at the bus stop. Thus the odds ratio for car versus bluebus riership would changefrom i:1 to 2:1, violatingthe assumptionof the model. Now consideranotherexample.Supposetherearetwo restaurantsin a neighborhood,a ![erican andan Italian restaurant,andthat the Mexican restaurantgets60 percentof the total r.iness. Then a new Chineserestaurantopensin the neighborhoodanddrawsoff 20 percent :idre businessof the Mexican restauant and20 percentofthe businessofthe Italian restau:::]r The Mexicanrestaurant'sshareof thetotal is now 48 percent,andthe Italian restaurant's (trA) ;;re of the total is 32 percent. Here the independenceofirrelevantaltematives rsrmption holdsbecause60/40 : 48/32 : 312. Becausethe multinomial model is misleadingwhen the IIA assumptionis violated, \(;Fadden suggeststhat multinornial(andconditional)logisticregressionmodelsshould :E estimatedonly when the outcomecategories"can plausiblybe assumedto be distinct md weighedindependentlyin the eyesofeach decisionmaker" (1974,I13). A formal testofthe IIA propertyis available,implementedin Stata10.0as suestremingly unrelatedestimation,"a generalization of an earliercommand,hausman). la€ suest test comparesmodelsthat do and do not include presumablyirrelevant :qicomes.If the resultingparametersfor the restrictedanduffestrictedmodelsare simi:: the additionaloutcomescan be assumedto be irrelevant.Applying theseideasto our ::rrent example,we might ask whetherthe oddsthat peoplespeakEnglish are affected f. including "Russian" as an alternativein the model. In this case the test strongly ;.sgests that the IIA condition is not satisfied.Thus we might considerestimating r,equential logit model in which we successivelyconsidertwo {uestions:whethera =spondentspeakseither Russianor English versusspeakingneitherlanguage,and for :L'h of the two subsetsof respondentsthosespeakingRussianand those speaking 1:,glishwhether they speakthe otherlanguageaswell. For fulher discussionof the IIA assumptionand its consequences. seeMcFadden (1988). (1984), Hoffman Hausman and McFadden and Duncan Zhang and 97.1), (1993), (1997,182184), (2000. Long Powers and Xie 215 247). Long and ;:frman (2007). (2006), suesti=ese andthe hausman and entriesin Statacorp Addi:rroal examplesof the applicationof multinomial logit modelsincludeAIl and Shields (1999t.and Breen and Skages ,991),Haynesand Jacobs(1994),TomaskovicDevey (2000). rcd Jonsson
342
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
ORDINATTOGISTIC REGRESSTON Often in the social scienceswe haveordinal dependentvariables,wherethe response ries canbe orderedon somedimensionbut wherethedistancebetweencateeorieiis ur Most attitude variablesare of this sort. For example,if people are askedto say how hfl lhey are, and the responsecategoriesinclude ,.veryhappy,',,,prettyhappy,,'and .,ncr: happy,"there is no ambiguity in assumingthat those who say they are ,.pretty happllesshappythan thosewho saythey are'\;ery happy',andaremorehappythan thoseu.bc, they are "not too happy."However,thereis no basisfor assumingthat the distance "not too happy" and "pretty happy" is the sameasthe distancebetween..prettyhapp\,'1ery happy." Many other aftinrde scaleshave similar properties.In such caseswe o predict the scalescoreusing ordinary leastsquares regression.However,to do so wouk tantamountto assumingthat the distancebetweenresponsecategoriesis uniform. (For a ful discussionof this andotherpoints, seeWinship andMare [1984].) An altemativeis to estlmatean ordinal logit eqtJation,which makesuseof the property of the responsecategorieson the dependentvariable but makesno at all abouttherelativedistancesbetweencategories. The basicassumptionof the ord logit model is that thereis an unobservedcontinuousdependentvariable,f*. whicb linearfunctionof a setof independentvariables: Y* :
al
Db jx j + p
However,what is observedis a setof orderedcategories,y : 1 . .. { suchthat Y:Iifcn3Y*1kr Z rf kt