Quantitative Data Analysis: Doing Social Research to Test Ideas (Research Methods for the Social Sciences) [1 ed.] 0470380039, 9780470380031

Table of contents : Contents......Page 3 Tables, figures, exhibits, and boxes......Page 8 Preface......Page 19 The autho

1,055 133 131MB

English Pages 448 Year 2009

Report DMCA / Copyright


Polecaj historie

Quantitative Data Analysis: Doing Social Research to Test Ideas (Research Methods for the Social Sciences) [1 ed.]
 0470380039, 9780470380031

Citation preview

QUANTITATIV DATAANALYSIS rch I Resea D oingS ocia to Testldeas


If i?j[i,i:l[fri:,

reserved' [email protected] JohnWiley & Sons'Inc All dghts by JosseY-Bass Published

com cA 941O3-wwwjossevbass ftltijt?tlltJ,l'",, t"' Francisco, form stored in a retrieval system' or tansmitted in any No part of this publication may b€ reproduced' exceptas oth"*ise' ol t:o:lt:q. or bv anv means,elecfonic, mechatucal,photocopying'-recording theprior either Act'without



ul'iei s'ut"'copvright


"r aulori^tion trttougrtpuy-"ni of-theappropriate-p"-t:1oP1*" "' wrinen Dermissionof the putrisrrer,or 'i]"'iiie (e78) 750-8400' oiuq n-u"'l MA'ore23' ;;;;'I*: il;i""i iliilt:;;;ii should permission for publisher the t n"q*t" '?;;il"*ooa o. onttn"ut *t* fax (978)646-8600, NJ Hoboken' "clt stree! "oi-yig;' River l,1l Inc : to thePer.ir.ion. o"ptii!'ni,i"-rt^ wii"y n Sons' be addressed ssrons. www.wiley.com/so/pen at online or oiriid, iii,1j i1d_oor1,fax 201,744_6008, ascitationsor sourcesfor further information Readersshouldbe awarethat InternetWebsitesoffered waswrittenandwhenit is rcad' this time the between disappeared .. ."1 t """-.ft-ag"a publisherandauthorhaveusedtheir bestefforts Limit of Liability/Disclaimer of warranty: while the or com wi*l respeclto lhe accuracy or lhis book.Lheymakeno repre'enlations wafianlie' in DreDaring or merchanrabil' warranties implied rr,i. roor #i ,fi"iri.aiiy di-ctaimany ;iJ";K, ;i ;;;.;;;;;,,'oi



I The aivice and strategies contained herein may-not ,"1* .it"tials .i *iii." nor author shall publisher the leither upp.op;ut". ation.you should consutt wltt, a protessiinut-*-fi.." to special' limited not but including oit*'' be liable for any loss of p.ot t o' "ommerciJdamages' -ydamaBes' or other con(equential. rncidental.

most bookstores To-contactJossey-Bassdirectl) JosseyBass books and products are availablelhrough the United *itio if," Unitla Star". ur 1a0O)956-?739' outside call our CusromerCar" u"p*"n, (317) 572-4002' Siatesat (3ll) 572-3986' oi via fa'x at formats some content that appearsin Jossev-Bassalso publishesits books in a variety ofelecftonic print may not be ivailable in electronic books' Library of Congress Cataloging'in-Publication


Donald J. Treiman, -jutu unalysis : doing social researchto test ideas/ Donald J Treiman d"-tl[G D, Cm,

2.Sociorogv-f,esearch-statist "liJJj;l.T'3;:3:;:,t3*"f33?,*,n"^"thods. methods-Computer + Socialsciences-statistical

methods. 3. Sociology-statisticar -"if'oOt programs. 5. Stata. I Title HA29.T675 2008 300;72-4c22 Printed in the United StatesofAmerica FIRST EDITION

PB Printing

l0 9 8 7 6 5 '1 3 I I


-*fq-$ Tg$XT'-{. fables, Figur€s,Exhibits. and Boxes




The Author


Introduction CROSS-TAB U LATIONS What This ChapterIs About Introductionto the Book via a ConcreteExample Cross-Tabulations What This ChapterHas Shown MORE ON TABLES What This ChapterIs About The Logic of Elaboration SuppressorVariables Additive and InteractionEffects Direct Standardization

xxix 1 1 2 8 19 21 z1 22 ).) 26 28

A Final Note on StatisticalControlsVersusExperiments What This ChapterHas Shown STILLMORE ON TABLES What This ChapterIs About ReorganizingTablesto Extract New Information When to Percentagea Table "Backwards"

45 47 47 48 50

Cross-Tabulations in Which the DependentVariable Is Representedby a Mean Writing About Cross-Tabulations

52 58 61

What This ChapterHas Shown


Index of Dissimilarity





What This ChaprerIs Abour




How Data Files Are Organized Transforming Data What This ChapterHas Shown Appendix 4.A

Doing Analysis Using Stata Tips on Doing Analysis Using Stata Someparticularly Useful Stata 10.0Commands

INTRODUCTIONTO CORRELATION AND REGRESSION (ORDINARYLEASTSQUARES) What This ChapterIs About Introduction Quantifying the Size of a Relationship:RegressionAnalysis Assessingthe Strengthof a Relationship: CorrelationAnalysis The RelationshipBetweenCorrelation and RegressionCoefficients FactorsAffecting the Size of Correlation(and Regression)Coeflicients CorrelationRatios What This ChapterHas Shown 6


Introduction A WorkedExample:The Determinants of Literacy in China Dummy Variables A Strategyfor ComparisonsAcross Grouos A BayesianAlternativefor Comparing Models IndependentValidation What This ChapterHas Shown



72 80 80 80 84

87 87 88 89 o1

94 94 99 102

r03 103 104 113 120 124 133 135 136

139 139 140

contentsVii Tesrin,ethe Equality of Coefficients TrendAnalysis: Testingthe Assumption of Linearity LrnearSplines Lrpressing Coefficientsas Deviationsfrom

MULTIPLEIMPUTATIONOF MISSING DATA \\tar This ChapterIs About lntroduction -\ WorkedExample:The Effect of Cultural Capital on EducationalAttainmentin Russia \\hat This ChaprerHas Shown SAMPLEDESIGNAND SURVEYESTIMATION \\har This ChapterIs About SurveySamples Conclusion \nlar This ChapterHas Shown REGRESSION DIAGNOSTICS what This ChapterIs About Introduction A WorkedExample:SocietalDifferences in StatusAttainment RobustRegression

' ! 1 SCALECONSTRUCTION What This ChapterIs About Introduction

149 152


Grald Mean (Multiple ClassificationAnalysis) OrherWaysof RepresentingDummy Variables Decomposingthe DifferenceBetween Two Means \\'har This ChapterHas Shown

Bootstrappingand StandardErrors What This ChapterHas Shown


r64 166 172 179 181 181 \82 187 194 195 t95 196 )t7

224 225 225 226 229 237 238 240 241 241 1,41

Validiry Reliability

242 243



Contents ScaleConstruction


Errors-in-VariablesRegression What This Chapter Has Shown


LOG-LINEARANALYSIS What This ChapterIs About Introduction Choosinga PrefenedModel ParsimoniousModels A Bibliographic Note What This ChapterHas Shown Appendix 12.A Derivation of the Effect parameters Appendix 12.8 Introductionto Maximum Likelihood Estimation Mean of a Normal Distribution Log-Linear Parameters


BINOMIAL LOGISTICREGRESSION What This ChapterIs About Introduction Relationto Log-LinearAnalysis


263 263 264 265 277 294 295 295 297 298 299 301 301 302 303

A WorkedLogistic RegressionExample: PredictingPrevalenceof Armed Threats A SecondWorkedExample:SchoolingprogressionRatiosin Japan

304 314

A Third WorkedExample (Discrete-TimeHazard_Rate Models): Age at First Marriage


A FourthWorkedExample(Case-ControlModels): Who WasAppointed to a Nomenklataraposition in Russia? What This ChapterHas Shown Appendix l3.A Some Algebra for Logs and Exponents Appendix 13.8 Introduction to probit Analvsis

327 329 330 330


335 J J.)


Contents lX frinal

Logistic Regression


Tobit Regression(andAllied Procedures)for Censored DependentVariables Otter Models for the Analysis of Limited DependentVariables &'hat This ChapterHas Shown


353 360 361

IMPROVINGCAUSAL INFERENCE: FIXED EFFECTS AND RANDOM EFFECTS MODELING What This ChapterIs About Introduction Frxed Effects Models for Continuous Variables RandomEffects Models for ContinuousVariables A Worked Example: The Determinants of Income in China Fired Effects Models for Binary Outcomes A Bibliographic Note Wtat This ChapterHasShown

363 363 364 365 371 372 375 380 380


RESEARCH DESIGN AND INTERPRETATION ISSUES whar rhis Chapter is About ResearchDesignIssues The Importanceof Probability Sampling A Final Note: Good ProfessionalPractice What This ChaDterHas Shown

38r 381 382 397 400 405

Appendix A: Data Descriptions and Download Locations fot lie Data Used in This Book


Appendix B: Survey Estimation with the General Social Survey






':-,-,::,li::1,i' ;.l.ll LiFl,-..,

a:.x:X Ii:::.-i,:;,,*rXf":* i-::'.,:: i, TABLES I .1.

Joint FrequencyDisrributionof Militancy by Religiosity Among UrbanNegroesin the U.S., 1964.


PercentMilitant by ReligiosityAmongUrbanNegroes in the U.S., 1964.


PercentageDistribution of Religiosity by EducationalAttainment, UrbanNegroesin the U.S., 1964.


PercentMilitant by EducationalAttainment,Urban Negroes in the u.s., 1964.


PercentMilitant by Religiosity and EducationalAttainment, UrbanNegroesin the U.S., 1964.


PercentMilitant by Religiosity and EducationalAttainment, Urban Negroesin the U.S., 1964(Three-DimensionalFormat).


PercentageWho Believe Legal Abortions ShouldBe PossibleUnder SpecifiedCircumstances,by Religion and Education,U.S. 1965 (N : 1,368;Cell Frequencies in Parentheses).


Percentage AcceptingAbortion by Religion and Education (HypotheticalData).


PercentMilitant by Religiosity,and PercentMilitant by Religiosity Adjusting (Standardizing)for Religiosity Differencesin Educational Attainment,UrbanNegroesin the U.S., 1964(N : 993).


1.3. 1.4. 1.5. 1.6. Ll.

2.2. 2.3.


PercentageDistribution of Beliefs Regardingthe Scientific View of Evolution(U.S.Adults,1993.1994.and2000).


Percentage Accepting the ScientificView of Evolution by ReligiousDenomination(N : 3,663).


Percentage Acceptingthe ScientificView of Evolution by Level of Education.


Percentage Accepting the ScientificView of Evolution by Age.


Percentage Distributionof Educational Attainmentby Religion


PercentageDistribution ofAge by Religion.


Joint ProbabilityDistribution of EducationandAge.


35 35 36


Tables,Figures,Exhibits,and Boxes

2 .11. PercentageAccepting the ScientificView of Evolution by Religion, Age, and Sex (PercentageBasesin Parentheses) 2.12. ObservedProportionAccepting the ScientificView of Evolution, and ProportionStandardizedfor EducationandAge. 2.r3. PercentageDistribution of OccupationalGroupsby Race,South African Males Age 20-69, Early 1990s(Percentages ShownWithout Controlsand also Directly Standardizedfor Racial Differencesin EducationalAttainment";N = 4,004). 2.14. Mean Number of ChineseCharactersKnown (Out of 10), for Urban and Rural ResidentsAge 20-69, China 1996(MeansShown Without ControlsandAlso Directly Standardizedfor Urban-Rural Differencesin Distribution ofEducation; N : 6,081). FrequencyDistribution ofAcceptanceof Abortion by Religion andEducation,U.S.Aduits, 1965(N : 1,368). Social Origins of Nobel Prize Winners(1901-1972)and Other U.S. Elires (and,for Comparison,the Occupationsof EmployedMales i900-1920). 3.3. MeanAnnual Income in 1979Among ThoseWorking Full Time in 1980,by Educationand Gender,U.S. Adults (Category FrequenciesShownin Parentheses). Meansand StandardDeviationsof Income in 1979bv Education and Gender,U.S. Adults, 1980. 3.5. MedianAnnual Incomein 1979Among ThoseWork rg Full Time in 1980, by Educationand Gender,U.S. Adults (CategoryFrequencies Shownin Parentheses).


6.3. 6.4.

PercentageDistribution Over Major OccupationGroupsby Race and Sex,U.S. Labor Force, 1979(N : 96,945). Mean Number of PositiveResponsesto an Acceptanceof Abortion Scale(Range:0-7), by Religion, U.S. Adults, 2006. Means,StandardDeviations,and CorrelationsAmong Variables Affecting Knowledgeof ChineseCharacters,EmployedChinese Adults Age 20-69, 1996(N = 4,802) Determinantsof the Number of ChineseCharactersConectly Identifiedon a Ten-ItemTest,EmployedChineseAdults Age2U69,1996 (StandardEnors in Parentheses). Coefficientsof Models ofAcceptanceofAbortion, U.S. Adults, 1974 (StandardErrors Shownin Parentheses); N : 1,481. Goodness-of-FitStatisticsfor Altemative Models of the Relationship Among Religion, Education,andAcceptanceofAbortion, U.S. Adults, 1973(N = 1,499). DemonstrationThat Inclusionof a Linear Term Does Not Affect PredictedValues.

37 39


42 48 51


58 60



116 127



Tables, FiguretExhibits. and BoxesXiii ":



Cefficiens for a Linear Spline Model of Trends in years of Sciool Compleredby year of Birth, U.S. Adults Age 25 and Older, ad Comparisonswith Other Models (pooled Datafor 1972_2004, \ : -19.324). Goodness-of-FitStatisticsfor Models of Knowledgeof Chinese Cba-actersby year of Birth, Controlling for years of Schooling, rirh \-arious Specifications of the Effect of the Cultural Revolution rTbose Affected by the Cultural Revolution Are Deflned peoole as Tuning Age I I During the period 1966ttuough 1977),Chinese -{dnlts Age 20 ro 69 in 1996(N = 6,086). Cocfficientsfor Models 4, 5, and 7 predicting Knowledgeof Chinese Charactersby year of Birth, Controliins for ye;rs ( p Valuesin parentheses). of Scbooti_ng


CoefficientsofModels of ToleranceofAtheists, U.S. Adults, 1[O to 2004 (N : 4,299). -6, Desiga Matrices for Alternative Ways of Coding Categorical \-ariables(SeeText for Details). Coefficients for a Model of the Determinants of Vocabulary Knorrledge,U.S. Adults, 1994(N : 1,,757R2 : .2445: Sald TestThat CategoricalVariablesAll Equal Zetot F.t,rrrt = 12.48; p :.r.iation Membership. -::quenl - ::quenl Distribution of Occupationby Father'sOccupation, C:rnese-{dults,1996. -:,:;raction Parametersfor the SaturatedModel Applied to Table 12.9. G..odness-of-FitStatisticsfor AlternativeModels of Intergenerational O,-cupational Mobility in China(Six-by-SixTable).



276 278 280 282 284

F:;quency Distribution of EducationalAttainmentby Size of ?,::e of Residenceat Age Fourteen,ChineseAdults Not Enrolled :: School.1996.


P.rcentageEver Threatenedby a Gun, by SelectedVariables,U.S. {Jults. 1973to 1994(N : 19,260).


G..t dness-of-FitStatisticsfor VariousModels Predictingthe P::ralenceof ArmedThreatto U.S.Adults, 1973to 1994. Eie!-r Parametersfor Models 2 and4 of Table 13.2.

308 310

Goodness-of-FitStatisticsfor VariousModels of the Processof ErucationalTransitionin Japan(PreferredModel Shownin Boldface).


Eiect Parameters for Model 3 ofTable 13.4.


OddsRatiosfor a Model Predictingthe Likelihood of Marriagefrom \Ee at Risk, Sex,Race,and Mother's Education,with Interactions Bet$ eenAge at Risk and the OtherVariables. Coeillcientsfor a Model of Determinantsof Nomenklatura \Iembership,Russia,1988.


Efiect Parametersfor a Probit Analysis of Gun Threat(Corresponding :.r \lodels 2 and4 ofTable 13.3).


Ettect Parametersfor a Model of the Determinantsof English and RussianLanguageCompetencein the CzechRepublic, 1993 p Valuesin Italic.) \ : 3,945).(StandardErrors in Parentheses;


Eftect Parametersfor an OrderedLogit Model of Political Party Identification, U.S.Adults, 1998(N : 2,443).


PredictedProbability Distributionsof Party Identificationfor Black and non-BIackMales Living in Large CentralCities of Non-Southern S\lSAs and Earning $40,000to $50,000perYear.


XVi 14.4. 14.5. 14.6. 14.7.

15.1. 15.2. 15.3.

Tables,Figuret Exhibits,and Boxes Effect Parametersfor a GeneralizedOrdercdLogit Model of political Party Identification,U.S. Adults, 1998. Effect Parametersfor an Ordinary Least-Squares Regression Model of Political party ldentification,U.S. Adults, 199g. Codesfor Frequencyof Sex in the Pastyear, U.S. Adults, 2000. AlternativeEstimatesof a Model of Frequencyof Sex,U.S Adults, 2000 (N : 2,258).(StandardErrors in parenthesesl All CoefficientsAre Significantat .001 or Beyond.) SocioeconomicCharacteristicsof ChineseAdults by Size ofplace of Residence,1996. Comparisonof OLS and FE Estimatesfor a Model of the Determinantsof Family Income,ChineseRMB, 1996(N : 5,342). Comparisonof OLS and FE Estimatesfor a Model of the Effect of Migration and Remittanceson SouthAfrican Black Children,s SchoolEnrollment,2OO2to 2003.(N(FE) : 2,408 Children; N(full RE) = 12,043Children.)

350 354 356

357 373 374



The ObservedAssociationBetweenX andy Is Entirelv Spurious and Coes to Zero When Z Is Controlled.


The ObservedAssociationBetweenX andy Is partlv Sourious: theEffecrof X on Y ls ReducedWhenZ Is Controll;d(Z Affecrs X and Both Z and X Affect Y). The ObservedAssociationBetweenX andy Is Entirely Exolained by the InterveningVariableZ and Goesto Zero When 2 Is bontrolled. The ObservedAssociationBetweenX andy Is partly Explainedby the InterveningVariableZ: the Effect of X on y Is ReducedWhen Z Is Controlled(X Affects Z, and Both X and Z Affecr y).

2.5. 2.6.

4.1. 5.1. 5.2.

Both X and Z Affect Y, but ThereIs no AssumptionRegarding the CausalOrdering of X and Z. The Size of the Zero-OrderAssociationBetweenX andy (andBetween Z andY) Is Suppressed When the Effects ofX on Z andy haveOpposite Sign, and the Effects ofX and Z ony haveOppositeSign. An IBM punch card. ScatterPlot of Yearsof Schoolingby Father,syears of Schoolins (HypotheticalDara.N : t0). Least-Squares RegressionLine of the RelationBetween Yearsof Schoolingand Father'sYearsof Schoolins.

24 24 25

26 11

88 89

T Tables, Figures, Exhibits. and Boxes XVii -.:-.:-.iuares RegressionLine of the RelationBetweenyears S:: -.-'irn,sand Father'sYearsof Schooling,ShowingHow the '::::: Prediction"or "Residual"Is Defined. '-: -;..:-Squares RegressionLines for Three Conligurationsof Data: : :-:::.rl Independence, (b) PerfectCorrelation,and (c) perfect ----. :-:;ear Correlation-a ParabolaSymmetricalto the X-Axis. -:: I-e;r of a SingleDeviantCase(High Leveragepoint). - :-'.:=:lng DistributionsReducesCorrelations. - :: iiecr of Aggregationon Correlations. of the Relationship Between --:-:: DimensionalRepresentation \::-:er of Siblings,Father'sYearsof Schooling,andRespondent,s -::--. ri Schooling(Hypothetical Data;N : l0).


92 95 97 99


:r:e;:ed \umber of ChineseCharactersIdentified (Out of Ten) , . \:,r: ol Schoolingand Gender,Urban Origin ChineseAdults Age 20 : :- ::r 1996with NonmanualOccupationsand with years of Father,s S: :l.ine andLevelof CulturalCapitalSetat TheirMeans(N : 4,g02). \::e: ihe temaleline doesnot extendbeyondl6 because thereareno :'::.".esin the samplewith post-graduate education.) 120 :,j-':pranceofAbortion by EducationandReligiousDenomination, 131 -.S. -\dulrs.1974(N : 1.481). --.-: RelationshipBetween 2003 Income andAge, U.S. Adults .{:: Ttlen*'to Sixty-Fourin 2004(N : 1,573). t4l :r-;ted 1n(Income) by YearsOf SchoolCompleted, U.S. Males Females.2004, with Hours Workedper WeekFixed at the -:: l'1i-rntbr Both SexesCombined(42.7;N : 1,459). 1,44 ir:e.-ied Incomeby Yearsof SchoolCompleted, U.S. Malesand ::neles. 2004,with Hours Workedper Week Fixed at the Mean for 3-.rhSeresCombined(42.7). 145 ::end in ArtitudesRegardingGenderEquality,U.S.AdultsSurveyed : i9r-l Through1998(LinearTrendandAnnualMeans;N=21,464). 151 f-:arsof SchoolCompletedby Yearof Birth, U.S.Adults (pooled S:mplesfrom the 1972Through2004GSS;N = 39,324;Scatter Pr.rtShownfor 5 PercentSample). 154 \lean Yearsof Schoolingby Yearof Birth, U.S. adults(SameData :i tbr Figure7.5). 155 Tluee-YearMoving AverageofYears of Schoolingby year of Birth, L.S. Adults(SameDataasfor Figure7.5). 155 Trendin Yearsof SchoolCompletedby Year of Birth, U.S. Adults SameData as for Figure 7.5). PredictedValuesfrom a Linear Splinewith a Knot at 1947. 158


Exhibits, andBoxes Tables, Figures,

7 .9.

Graphsof ThreeModels of the Effect of the Cultural Revolution on VocabularyKnowledge,Holding ConstantEducation (at TwelveYears),ChineseAdults, 1996(N : 6,086).

7.10. 10.1. 10.2.

10.3. 10.4.

Figure 7.9 Rescaledto Show the Entire Rangeof the Y-Axis. Four ScatterPlots with Identical Lines.

163 163 226

ScatterPlot of the RelationshipBetweenX andY andAlso the RegressionLine from a Model That IncorrectlyAssumesa Linear RelationshipBetweenX andY (HypotheticalData).


Yearsof School Completedby Number of Siblings,U.S. Adults, 1994 (N - 2,992). Yearsof SchoolCompletedby Number of Siblings,U.S. Adults, 1994.


A Plot of LeverageVersusSquaredNormalizedResidualsfor Equation7 in TreimanandYip (1989).


A Plot of LeverageVersusStudentizedResidualsfor Treimanand Yip's Equation7, with Circles Proportionalto the Size of Cook's D.


Added-VariablePlots for Treiman andYip's Equation7. Plot for Treiman andYip's Equation7. Residual-Versus-Fitted


Plots for Treimanand AugmentedComponent-Plus-Residual Yip's Equation7. 10.10. ObjectiveFunctionsfor ThreeM Estimators:(a) OLS Objective Function,(b) Huber ObjectiveFunction,and (c) Bi-Square ObjectiveFunction.


zz8 232 233 233 234


10.11. SamplingDistributionsof BootstrappedCoefficients (2,000Repetitions)for the ExpandedModel, Estimatedby RobustRegressionon SeventeenCountries. 11.1. 13.1. 13.2. 13.3. 13.4.





Loadingsof the SevenAbortion-AcceptanceItems on the First Two 255 Factors,Unrotatedand Rotated30 DegreesCounterclockwise. ExpectedProbability of Marrying for the First Time by Age at 320 Risk,U.S.Adults, 1994(N = 1,556). Risk the First Time by Age at ExpectedProbability of Marrying for (Range:Fifteen to Thirty-Six), Discrete-TimeModel, U.S. Adults, 1994. 3ZZ ExpectedProbability of Marrying for the First Time by Age at Risk (Range:Fifteen to Thirty-Six), Polynomial Model, U.S. Adults, 1994. ExpectedProbability of Manying for the First Time by Age at fusk, Sex, and Mother's Education(Twelveand SixteenYearsof Schooling), Non-Black U.S. Adults, 1994. ExpectedProbability of Marrying for the First Time by Age at Risk, Sex,and Mother's Education(Twelveand SixteenYearsof Schooling),Black U.S.Adults, 1994.




Tables,Figures.Exhibits,and Boxes XIX

:,:.8.1. ProbabilitiesAssociatedwith Valuesof Probit and Logit Coefficients. --+.l. 11.1. 16.1. -6.1.

ThreeEstimatesof the ExpectedFrequencyof Sex per Year, U.S. Married Women,2000 (N : 552). ExpectedFrequencyof Sex PerYearby Genderand Marital Status, U.S.Adults,2000(N : 2,258). 1980Male Disability by Quarterof Birth (Preventedliom Work by a PhysicalDisability). Blau andDuncan'sBasicModel oflhe Processof Stratification.



359 386 394

EXHIBITS :. 1 :2.

lllistration of How Data Files Are Organized. A CodebookCorresponding to Exhibit4.1.

67 68


Stata-do- Files and Jog- Files Direct StandardizationIn Earlier SurveyResearch

3 6 9 10 14 15 16 18 22 27 30 31

The Weaknessof Matching and a Useful Fix


TechnicalPointson Table3.3

53 54 66 70 72 75

Open-EndedQuestions SamuelA. Stouffer TechnicalPointson Table 1.1 TechnicalPointson Table 1.2 TechnicalPointson Table 1.3 TechnicalPointson Table 1.4 TechnicalPointson Table 1.5 TechnicalPointson Table 1.6 Paul Lazarsfeld HansZeisel

SubstantivePointsOn Table3.3 A Histodcal Note on Social ScienceComputerPackages HermanHollerith The Way Things Were TreatingMissing Valuesas If They Were Not


Tables,Figures,Exhibits,and Boxes

PeopleGenerallyLike to Respondto (Well-Designed andWell-Administered)Surveys Why Use the " Least Squares" Criterion to Determine the Best-FittingLine? Karl Pearson A Useful Computational Formula for r A "Real Data" Exampleof the Effect of Truncatingthe Distribution A Useful ComputationalFormulafor 12 Multicollinearity ReminderRegardingthe Varianceof DichotomousVariables A Formula for ComputingR':from Conelations Adjusted R'? Always PresentDescriptiveStatistics TechnicalPoint on Table6.2 Why You ShouldInclude the Entire Samplein Your Analysis Gettingp-valuesvia Stata Using Statato Comparethe Goodness-of-fitof RegressionModels R. A. (RonaldAylmer) Fisher

17 9I 93 93 97 101 108 110 111 r1 1 114 117 122

r25 125 126

How to Test the Significanceof the Difference BetweenTwo Coefficients Altemative Ways to EstimateBIC


Why the RelationshipBetweenIncome andAge Is Curvilinear


A Trick to ReduceCollinearity


In SomeYearsof the GSS,Only a Subsetof Respondents WasAsked CertainQuestions



An AlternativeSpecificationof SplineFunctions Why Black versusNon-black Is Better Than White versus Non-white for SocialAnalysis in the United States


A Commenton Credit in Science Why PairwiseDeletion ShouldBe Avoided


TechnicalDetailson lhe Variables TelephoneSurveys


Mail Surveys

r99 200 202 205

Web Surveys Philip M. Hauser A SuperiorSamplingProcedure

175 183 198

Tables, Figures, Exhibits, and BoxesXXi St-rurces of Nonresponse ["eslieKish Hos the ChineseStratifiedSampleUsed in the Design Erperimentswas Constructed $ii,ehdng Data in Stata Limitarions of the Stata10.0 SurveyEstimationprocedure -{n -{lternativeto SurveyEstimation Ho\l to DownweightSampleSize in Stata Eirs to AssessReliability $-h1' the SAI and GRE TestsInclude SeveralHundredItems TransformingVariablesso That ,,High,'has a ConsistentMeaning ConstructingScalesfrom IncompleteInformation h Log-LinearAnalysis "Interaction',Simply Means ,Association,, l: Defined Other Softwarefor EstimatingLog-Linear Models \larimum Likelihood Estimation ProbitAnalysis Techdcal Point on Table 13.1 Limitations of Wald Tests SmoothingDistributions EstimatingGeneralizedOrder Logit Models With Stata JamesTobin PanelSurveysin the PublicDomain Otis Dudley Duncan SewellWright -\sk a Foreigner To Do It GeorgePeterMurdock ln the United States,Publicly FundedStudiesMust be Made Available to the ResearchComrnunity Al'Available from Aulhor" Archive

207 ?08 212 2,13 215 219 219 244 245 248 249 264 267 294 302 302 305 309 325 349 354 369 395 396 398 401 404

, -, ,__ :l ,:-i ,



: , :. a book abouthow to conducttheoreticallyinfomed quantitativesocialresearch ":-: .. socialresearchto testideas.It derivesfrom a coursefor graduatestudentsin sociprofessionalschools(public -:, .rnd other social sciencesand social science-based -.-----. education,socialwelfare,urbanplanning,and so on) that I havebeenteachingat - -.t tbr somethirty years.The coursehasevolvedasquantitativemethodsin the social , ::::s haveadvanced;early versionsof the coursewere basedon the first half of this -.., r throughChapterSeven),with additionalmaterialsaddedover the years.Interest:-:-.. I havebeenableto retainthe sameformat a twenty-weekcoursewith onethree::-: -e.tureper week and a weekly exercise,culminatingin a term paperwritten dudng --i .-it lbur weeksof the course from the outset,which is, I suppose,a tributeto the --.:=sing level of preparationand quantitativecompetenceof graduatestudentsin ::= .-..ial sciences.The book owes much to lively classdiscussionsover the years,of :: :ubtle andcomplexmethodologicalpoints. tsr rheendof the book,you shouldknow how to makesubstantive senseof a body of data. you That is, prepared should be well produce to publishable papersin -:-,:::ative :-: neld. as well as first-ratedissertationchapters.Of course,thereis alwaysmore to :=:. In the final chapter(ChapterSixteen),I discussadvancedtopics that go beyond ; '.: .an be coveredin a first coursein dataanalysis. Tie focusis on the analysisof datafrom representative samplesof well-definedpop- ,:, rns.althoughsomeexceptionsareconsidered.The populationscanconsistof almost societies,occupations,pottery shards,or what--l -:-rns people,formal organizations, - ::. ihe analytic issuesare essentiallythe same.Data collectionproceduresare men- :J only in passing.Thele simply is not enoughspacein an alreadylengthybook to do -.::re to both data analysisand datacollection.Thus, you will needto look elsewhere r .i stematicinstructionon data-collectionprocedures. A strongcasecan be madethat .hould do this after rather than before a courseon data analysisbecausethe main :. : ---emin designinga data collectionefforl is decidingwhat to collect, which means - irst needto know how you will conductyour analysis.An altemativemethod of :--:ring aboutthe practicaldetailsof datacollectionis to becomean apprentice(unpaid, : ,:;essary) to someonewho is aboutto conducta surveyand insistthat you get to par,:::ate in it step-by-step evenwhenyour presence is a nuisance. Thisbookcoversa varietyoftechniques,includingtabularanalysis,log-linearmodels r :abulardata,regressionanalysisin its variousforms,regressiondiagnosticsandrobust -.::-\sion, ways to cope with missing data,logistic regression,factor-basedand other :::.niquesfor scaleconsnxction,andfixed- andrandom-effects modelsasa way to make ,.-.al inferences.But this is not a statisticsbook; the emphasisis on usingtheseproce:-:;s to drawsubstantive conclusionsabouthow the socialworld works.Accordingly,the :' .-.kis designedfol a courseto be taken after a first-yeargraduatestatisticscoursein -: rocial sciences.Although thereare many equationsin the book. this is becauseit is



necessa.ry to understandhow statisticalprocedureswork to usethernintelligently. Because the emphasisis on applications,there are many worked examples,often adaptedfrom my own research.In addition to data from samplesurveysI haveconducted,I also rely heavily on the GeneralSocial Survey,an omnibussurveydesignedfor use by the research community and also for teaching.Appendix A describesthe main data sets used for the substantiveexamplesand provides information on how to obtain them; they are all availablewithout cost. The only prerequisitesfor successfuluseof this book are a prior graduateJevelsocial sciencestatisticscourse,a willingnessto think carefullyandwork hard,andthe ability to do high school algebra-either rememberedor relearned.With only a handful of exceptions (referencesat one or two points to calculus and to matrix algebra),no mathematics beyondhigh school algebrais used.If your high schoolalgebrais rusty, you can find good reviews in Helen Walker, Mathematics Essential for Elementary St,,tistics, and W. L. Bashaw,Mathematicsfor Statistics.These books have been around forever. Although more recent equivalentsprobably exist, school algebra has not changed,so it hardly matters.Copiesof thesebooksarereadily availableat amazon . com, andprobably many otherplacesaswell. The statisticalsoftwarepackageusedin this book is Srara(release10). Downloadable commandfiles (-do- files in Stata'sterminology),files of results(-1og- files), and ancillary computer files used in the computations are available at wwwjosseybass. conr/golquantitativedataanalysis Often the details underlying particular computationsare only found in the downloadable do - and - 1og - files, so be sureto downloadandstudythemcarefully.Thesefiles will be updatedasnew releasesof Statabecomeavailable. I use Statain my teachingand in this book becauseit has very rapidly becomethe statistical packageof choicein leadingsociologyand economicsdepartments. This is not accidental.Statais a fast and efficient packagethat includes most of the statistical procedures of interest to social scientists,and new commandsare being addedat a rapid pace. Although many statistical packagesare available, the thrce leading contenderscurrently are Stata,SPSS,and SAS. As software,Statais clearly superiorto SPSS-it is faster, more accurate,andincludes a wider rangeof applications.SAS, althoughvery powerful, is not nearly as intuitive as Stata and is more difficult to learn (and to teach). Nonetheless, this book canbe readilyusedin conjunctionwith eitherSPSSor SAS, simply by translating the syntaxofthe Stata-do- files.(I havedonesomethinglike this,exploitingAllison's excellent,but SAS-based,expositionof fixed- andrandom-effects models[Allison 2005] by writing the correspondingStatacode.)

FORINSTRUCTORS Somenotes on how I have usedthesematerials in teaching may be helpful to you as you designyour own course. As noted previously,the courseon which this book is basedruns for two quarters (twenty weeks). I have offered one three-hour lecture per week and have assignedan exerciseeveryweek.When I fust taughtthe course,I readtheseexercisesmyself,but as



:-:: -::rentshaveincreased,I haveenjoyedthe servicesof a T.A. (chosenfrom among -::.-:. $ ho haddonewell in the coursein previousyears),who assistsstudentswith the : .::ies of computingand statisticsand also readsand commentson the exercises.In lecturesandhaveassignedexercisesfor all but the :::r: \eais. I haveofferedseventeen '.'' the course devotedto producingtwo draftsof a term paper -. -- :ih rhe final monthof : :::rJirihon sessionI readthe first draftsandwrite comments,in an attemptto emulate : : - : -:nal submissionprocess.Thus, in my course,everyonegetsa "reviseand resub::-: :i>ponse.I encouragestudentsto developtheir telm papersin the courseof doing andto completetheir draftsin the two weeksafter the lastexerciseis due. -:= -.:::ises l;-: initial exercisesare designedto lead studentsin a guided way through the , :-:::rics of analysis,and someof the later exercisesdo this as well. But the exercises - -::-.:nglr take a free form: "carry out an analysislike that presentedin the book." ,,:-.:ir e answersareprovidedfor thoseexercisesthat involvedefinitiveanswers that , ,- .3 sin-Iilarto statisticsproblemsets. -:3 .oursesyllabus,weekly exercises,andillustrativeanswersto thoseexercisesfbr i:-[ have written illustrative answersare availablefor downloadingfrom www. : ::_.r.i:s.com/go/quantitativedataanalysis

ACKNOWLEDGMENTS -,. , r:3dearlier,this book hasbeendevelopedin interactionwith manycohortsof gradu-:. .::dents at UCLA who havewrestledwith eachof the chaptersincludedhere and :- . :erealed troubles in the exposition, sometimesby way of explicit comments -- - : r:nerimesvia displaysof confusion.The book would not exist without them, as I :: :: -naginedmyselfwdting a textbook,and so I owe themgreatthanks.Onein partic---. ?.rmelaStoddard,literally causedthe book to be publishedin its currentfolm by : ::-.:ing in the courseof a chanceairplaneconversationwith Andrew Pastemack,a ...' , -Bassacquisitionseditor,that her professorwas thinking of publishingthe chap.. . : usedas a coursetext.Andy contactedme, andthe restis history. h: courseon which this book is basedfirst cameinto being throughcollaboration i -: :r] colleagueJonathanKelley, when he was a visiting professorat UCLA in the - - .. The first exerciseis borrowedfrom him, andthe generalthrustof the course,espe- - -. :re lirst half, owesmuch to hrm. \ly colleague,Bill Mason,recentlyretiredfrom the UCLA Sociologyand Statistics -..:::rients, hasbeenmy statisticalguru for manyyears.Otien I haveturnedto him lbr :: i::s irto difficult statisticalissues.And much that I have learnedabout topics that ;: : :roi part of the cuniculum when I was a graduatestudenthas beenfrom sitting in -,: ::red statisticscoursesofferedby Bill. Anothercolleague,Rob Mare, hasbeenhelp-- -. :nuchthe sameway.My new colleague,JennieBrand,who took over my quantita- : :;ia analysiscoursein the fall of2008, hasreadthe entiremanuscdptandhasoffered relptul suggestions. Finally, the book hasbenefitedgreadyfrom very carefulread--.,. .: :l' a group of about 100 Chinesestudents,to whom I gavea specialversionof the , --:.: in an intensivesumner sessionat Beijing University in July 2008.They caught



ftmy errors that had gone unnoticed and mised often subtle points that resulted in the reworking of selectedportions of the text. My understanding of research design and statistical issues, especially conceming causality and theats to causal inference, has benefited greatly from the weeHy seminar of the Califomia Center for Population Research,which brings together sociologists, economists, ald other social scientists to listen to, and corrment on, presentationsof work in progress,mainly by visitors from other campuses.The lively and wide-ranging discussionhasbeen somethingof a floating tutorial, a realization of what I haveimagined academiclife could and should be like. Finally, my wife, Judith Herschman,has displayed endlesspatience, only occasionally asking, "When are you going to finally publish your methodsbook?"

. : & L JYht ** H t Treiman is distinguishedprofessorof sociologyat the Universityof Califomia u --s 1:.:-:s rLCLA) andwas until recentlydirectorof UCLA's Califomia Centerfor aorurr,:r Re:earch.He hasa BA from ReedCollege(1962)and an MA andphD from ! -n-.-.-:r .-'fChicago(1967).As a graduatestudentat Chicago,he spentmostofhis .f, \aiional Opinion ResearchCenter(NORC), wherehe gainedvaluabletrain_ :- .Er:- :1-nence in surveyresearch.He then taught at the University of Wisconsin, rntae :l :e,-ided that he really was a social demographerat heart, and made the Center ru }:,-1:rrph1 and Ecology his intellectualhome. From Wisconsin,he moved to I 'rrrrn-; Lnirersitv and then, in 1975,to UCLA, wherehe has beenever since,albeit qd E\i=J-1 so.;ournselsewhere,as staff director of a study committee at the National r;rrr='. .:: Sciences,4.Jational ResearchCouncil (1978-1981)and fellowship yearsat Bl:eau ofthe Census(1987-1988), theCenterfor AdvancedStudyin theBehav_ ---i umr rc S.r-ialSciences(1992 1993),andthe NetherlandsInstitutefor AdvancedStudy r M and SocialSciences(1996-1,997). l::--.or Treiman startedhis careeras a studentof social stratificationand status --::rrniries il.!yn-..:-- parricularlyfrom a cross-nationalperspective,and this has remained a con_ i'Fr._r :::3resr.He andhis Dutch colleague,Harry Ganzeboom,have beenengagedin a {mr--€:= project to analyzevariationsin the statusattainmentDrocess --ross-national [irrlr. :::!-lD! throughoutthe world over the courseof the twentiethcentury.To date, tEl r:-,: ;ompiled an archiveof more than 300 samplesurveysfrom more than 50 m:cs- =ngrns through the last half of the century. In addition to his comparativeproj_ s ?:: -::sor Treimanhas conductedlarge-scalenationalprobability samplesurveysin [email protected] \--.,-a | 1991-1994),EastemEurope( 1993-1994),andChina(1996),all concemed q [ -.J::.u! aspectsof socialinequality. :lj .Lrent researchhasmovedin a more demographicdirection.He hasa national !r.rr!---'::\ lample surveycurrentlyin progressin China,which focuseson the determ! m.- :i:amics. andconsequences of internalmigration.

:r,{-rK*milcT-l*ru I -. :or uncommonfor statisticscoursestakenby graduatestudentsin the socialsciences x :E [eated essentiallyasmathematicscourses,with substantialemphasison derivations rnc:roofs. Evenwhenempiricalexamplesareused-which they frequentlyarebecause 1947and = 0 otherwise.Thenfor thoseborn in 1947or earlier, t : a + b,(X) + b,(o) : a + b1(X) whilefor thoseborn laterthan 1947 E: a + b,(X) + b2(X- 44 Thus,for those born in 1948,the expectedlevelof educationis given by (a + 48b,) + b,; for thoseborn in 1949it is (a + 4gb1)+ 2br;and so on. Fromthis,it is evidentthat b, give; the deviationof the slopefor the previouslinesegment.Forusefuldiscussions of thesemethods,seeSmith(1979)and Gould(1993).

r.'!ultiple Regression Tricks: Techniques for Handringspeciar Anaryticprobrems 157 '- . -16


r a,S ?.3, Co.ffi.i.rrts fora LinearSplineModel of Trends years in of School Completed by year of Birth, U.S. Aiults age iS Ofa.., comparisonswith other Moders(pooredDatar". "na = ".rO r6'iz-zoo+,


r f . , 3Ltion tbr rPecred v *lose tion lbr e-oeffia linear ear-b] : that a osiring spbne tion. is


s,e. 5bpe '. :: ,:..'i'.: 5.ope(bjrthyearsI94j-1979)

,i.:: .0092


r*""u1,rr,1.,:, Model Comparisons

2) Lineartrendmodel


(3) :. i5

I ) vs.(2)

-5 31



1;39321 .OOO0

:-:arly inferior by the BIC criterion,and occurs simply as a consequence of the large i-mple size.Thus, I acceptthe linear splinemodel asttrepreteneJmoO"t. The coefficientsfor the line segmentsindicatethat for peopt"iorn in 1947or earljer, :ere is an expectedincreaseof .0g6yearsof schooling foi ,*""rriu" birth cohort. .._. us.peopleborn lwelve yearsapartwould be expectld "uin to differ on averageby abouta 1:ar of schooling.However,for peopleborn in 1947or later,;";" ." trendin educa-:.rnalattainment;the coefficient .0092 implies ttut it *ouiO iut "about a century for -:. eraqeschoolingto increaseby one year This is a somewhatsurpnsrng result, espe_ :::lly becausetherehavebeensu

.":nraged minoriries, rhat is, ",""1::#'iff'":!ilT":fi:-:rTli:,ffiH:tr"Hi,

::d also amongwornen.However,as Mare (1995, tb:; not"r-.d*utronally disadvan_ --!ed proportionsof the population havegrown over tim"..tutlu" to tn" White majority. )saggregation of the trend woula be wolrthwhile u* p"r.ued here;it would :rte,an interestingpaper The graph implied by the"""".ii. coefficienisfor the linear spline :Lrdel is shown.inFigure 7.g, togetherwiih u 2 i".".nt rundornslmpte of observations : rr eachcohort_(redlced from 5 percentto 2 percentto mut. it .J". to seethe shapeof :e.spline). In this figure the -j itterfeaiurein Stutui, u."J'io _uke it clearwhere : rhegraphthereis the greatestdensityof points.


DataAnalysis: DoingSocialResearch Quantitative to Testldeas

". :,t'..-i i f

?.... .'.t'


_g E .r





flrcfl diiMd-

o fr

N[dd dbi








l$.-" .!F, tEd l





1940 1950 Yearof birth




@m h ftr/rqi trtil

Ff &Untr 7 .&, rrenain Yearsof SchootCompteted by yearof Birth,U.S.Adutl (SameDataasfor Figure7.5;ScatterPlotShownfor 2 percentSampte). predicted Valuesfrom a LinearSplinewith a Knot at 1947.

tuq drF

A SecondWorked Example,with a Discontinuity: euality of Education in China Before, During, and After the Cultural Revolution The typical useof splinefunctionsis to estimateequationssuchasthe onejust discussedin which all points are connectedbut the slope changesat specifiedpoints (,,knots"rHowever, there are occasionsin which we may want to posit discontinuoas functionsThe Chinese Cultural Revolution is such a case.It can be argued that the disruption of socialorder at the beginningof the CulturalRevolutionin 1966was so massivethat it js inappropriateto assumeany continuity in trends.Deng and Treiman (1997) makejus such an argument with respect to trends in educational reproduction. They argue thal there was then a gradual 'tetum to normalcy" so that changesresulting from the end of the Cultural Revolution in 1977 were not nearly as sharp and were appropriately representedby a knot in a spline function rather than a break in the trend line. Here we consideranotherconsequence of the Cultural Revolution,the quality of educationreceived(the exampleis adaptedfrom Treiman [2007a]).Although prima4 schoolsremainedopen thoughout the Cultural Revolution,higher level schoolswere shutdown for varyingperiods:most secondaryschoolswereclosedfor two years,from 1966to 1968,and most universitiesand other tertiarylevel institutionswere closedfor six years,from 1966 to 1972. Moreover,it was widelv reDortedthat even when the


lr"D. lhEb

h br& {@E

fu r frFfr ffi{

ryE'ft bd rlidh' Ed &trI

hr mb &'n|n b

litultipleRegression Tricks: Techniques for HandlingSpecial AnalyticProblems 159

0 ldutts Pd

n hste.l lol-s- 1,

:tiorr. ion e.: ar it i: Ie JL.:i |e thar end of reprelir-r e-ti

rima+ t $ efe

. from ed for en lhe

siools were open, little conventionalinstruction was offered: rather, school hours rere taken up with political meetingsand political indoctrination.Rigorousacademic himrction was not fully reinstituteduntil 1977, after the death of Mao. Under the ;iriumstances,we might well suspectthat, quite apart from deficits in the affiount of siooling acquiredby thosewho wereunfortunateenoughto be of schoolageduringthe Culmral Revolution period, those cohorts also experienced deficits in the quality of $ooling comparedto thosewho obtainedan equalamountof schoolingbeforeor after fre Cultural Revolution. To test this hypothesis,we can exploit the ten-item characterrecognition test ,&iristered to a nationalsampleof Chineseadultsthat was also analyzedin Chapter SLx(seeTable6.2). As before,I take the numberof characterscorrectly identified as a of literacy andhypothesizethat, net of yearsof schoolcompleted,peoplewho -asure age eleven during the Cultural Revolution would be able to recognizefewer rned [Laractersthanpeoplewho turnedelevenbeforeor after the Cultural Revolutionperiod. Uoreover,following Deng andTreiman (1997),I posit a discontinuityin the scoresat tu beginningbut not at the end of the period. To do this, I estimatean equationof fre form:

i - a + b1(B)+ bz(B) + cr(Dr) + \(\)


rhere B, = year of bfuth (last two digits) if born prior to or in 1955 and : 55 ifbom Fbsequentto 1955;Br: 0 if bom prior to 195(, = year of birth - 55 if born between 1956and 1967,inclusive, and : 67 - 55 if bom subsequentto 1967',83: 0 if bom = 0 for lrior to or in 1,967and : year of birth - 67 for those bom after 1967i and D, : 1955. Note difference prior 1 for those bom after that the born to or in 1955 and 6ose henveenthis and Equation7.35 is that I include a dummy variableto distinguishthose born after 1955from thoseborn earlier;this is what permitsthe line segmentsto be disat 1955.If I were to havepositeda discontinuityat 1967as well, the equarr')otinuous :ion would be the mathematicalequivalentto estimatingthree separateequations,for rte periodbefore,during, and after the Cultural Revolution,in eachcasepredictingthe rtrmberof charactersrecognizedfrom yearsof schoolingand year of birth. The advanage of equationssuchas Equation7.38 is that they permit the specificationof altematire modelswithin a coherentframeworkand by so doing permit us to selectbetween nodels. Estimatingthis equationyields the resultsshownfor Model 4 in Tables7.3 and 7.4. -{s in the previous example,I contrastmy theory-driven specification with other possibiliries: that there is a simple linear trend in the data; that there are year-by-yearvariations; tat there are knots at both the beginning and the end of the Cultural Revolution, but no discontinuities; that there are discontinuities at both the beginning and the end of the Culural Revolution; and, for the three spline functions, that there is a curvilinear relationship between year and knowledge of characters during the Cultural Revolution period.


QuantitativeData Analysis:Doing SocialResearch to Testldeas





: ,'.l' Goodness-of-Fit statistics for Models of Knowledge of chinese Characters by year of Birth, Controlling for years of schooling, with Various Specifications of the Effect of the Cultural Revolution (Those Affected by the Cultural Revolution Are Defined as people Turning Age il During the Period 1966 through 19771,Chinese Adutts Age 20 to O9 in 1996 (N = 6,08G),

: 'Chinese Char a( lues in Paren ---Va

':=-i o: schocl;:l: .665 .616

i 956-'196: .g




- 6722.1


:: -



. 2A 71.72


1116.33 :--.



-€ar 1r€tc '-f5

-. ':::

4.26 - 42.4

:a Ba . a .=' .

30.04 ::






. 6.86



'a a - a _ e a - :t



: :;l _1i Lrn i :


' .-

- i ddl l i L-rr:


. : - , t t r ing iit : a. - :-:-rruities-.. : , , - . likelr r : . :

' - t iple Re g re s s i olnri c k s :T e c h n i q u efo s r H andl i ngS peci aA na yti c P robl errs

] 5l

' , :, Coefficients for Models 4, 5. and 7 Predicting Knowledge :- Chinese Characters by Year of Birth, Controlling for Years of Schooling :-Va lues in Parentheses).

:: 's of schooling

- i 955 or earlier(age11 1965or earler)



1956-1967(age11 1966*1977)

: r - - 1968or l a te r(a g e1 1 1 9 7 8o r l a te r)

: - i: q inu: t ya t ' 9 5 5


Model 5

.443 (.000)

.443 (.000)

A44 (.000)

0.001 \.721)

0.001 (.134\

0.001 (.749)

0.043 (0.000).

0.032 (0.000)

0.041 (.000)


-0.557 ( 000)

*0.508. (.000)

. -o.o4l (0.18s) 0.028 (.012) -0.349. / nnl\

o.241 (.010)

, : : r nt inuit ya t 1 9 6 7

0.0066 (.00e)


= : (rootmeansquareerror)

Model 7










. ,rnparison of the B.lCs suggeststhat three models-my hypothesized model, a model ,: in addition to a discontinuity at the beginning of the Cultural Revolution allows the =:J during the Cultural Revolution period to be curvilinear. and a model positing - .:ontinuities at both the beginning and the end of the Cultural Revolution are about , ..i1ly likely given the data, albeit with weak evidence favoring the single-knot model. - : that all three are strongly to be preferred over all other models.


QuantitativeData Analysis:Doing SocialResearch to Testldeas

Again, B1Cand classicalinferenceyield conrradictoryresultsbecausethe two alternativemodelsfit significantlybetter(at rhe 0.01 le\el) than doesthe originally hypothesizedmodel.Here I am in a bit of a quandaryas to u hich modelto prefer.I havealreadr stateda basisfor positing a single discontinuity.plus a knot at the end of the Cultura Revolution.However,anotheranalyst might favor a two-discontinuitymodel, on th; groundthat the curricularreform in 1977that restoredthe primacyof academicsubjecc was radical enoughto posit a discontinuityat the end as well as at the beginningof the Cultural Revolution.A third analystmight arguethat a linear specificationof trends. especiallyin times of great social disruption,is too restrictiveand that it makesmore senseto posit a curvilinear effect of time during the Cultural Revolutionperiod. Ir Treiman(2007a),I presentedthe model positinga discontinuityat 1955,a knot at 196-. and a curvebetween1955and 1967-see Figure7.4 in that paper.Howevel the truth i! thatthereis no clearbasisfor preferringany oneof the three,exceptfor the evidenceprc' vided by BlC, which suggeststhat ihe originally hypothesizedmodel is slightly mor; likely thanthe othersgiventhe data.Again, my suggestionis, go with theory.If you har: a theoreticalbasisfor one specificationover the others,that is lhe one to feature;but. iI the sametime, you mustbe honestaboutthe fact that alternativespecificationsfit nearll equally well. In fact, the optimal approachis to presentall threemodelsand invite th: readerto chooseamongthem.A waming: if you do this, you probablywill haveto figL with journal editors, who are always trying to get authors to reduce the length of papersand perhapswith reviewers,who sometimesseemto want definitive conclusionsere: when the evidenceis ambiguous. The estimatedcoefficientsfor all threemodelsare shownin Table7.4.In alt thr*modelseachadditionalyearof schoolingresultsin nearlyhalf a point improvementin dE numberof charactersidentified.However,the coefficientsassociatedwith trendsortime are relativelydifficult to interyrer.Again, this is an instancein which graphingrtr relationshiphelps.Figure7.9 shows,for eachof the threepreferredmodels,the predicr* numberof charactersrecognizedfor peoplewith twelve yearsof schooling,that is, *bi havecompletedhigh school.Although the threegraphsappearto be quite different,th1 all show a declineof abouthalf a point in the numberof charactersidentifiedfor thos who wereageelevenduring the early yearsof the CulturalRevolutionperiod,relatile :: thosewith the samelevel of schoolingwho tumed elevenbeforeand after the Cuhwir Revolution.Thus, despitethe difficulty in choosingamong alternativespecificatio*togetherthey stronglysuggestthat the quality of educationdeclinedduring the Culruii, Revolution.Peoplewho acquiredtheir middle school(unior high school)educationdr:ing the CulturalRevolution,in effect,lost a year of schooling-that is, displayedknos _edgeof vocabularyequivalentto thosewith one year lessschoolingwho wereeducai= before and after the Cultural Revolution. Still, we shouldbe cautiousin our interpretationof Figure 7.9, where the Culru:rr Revolutioneffect appearsto be quite large becauseof the way the data are graphi; (with the y-axis rangingfrom 5.3 to 6.7 charactersrecognized).Indeed,Figure 7.10_r which the y-axis rangesfrom 0 to 10, suggestsa ratherdiffereDtstory'-a very mod:s decline in the numberof charactersrecognized.It is quire rea_.onable ro reporr li,su::: suchas Figure7.9 to makethe differencesamonsthe model:__. does not permit more lhan two-way products, I take advantageoiu ur"r_\Vrr,,"n __-._ comman{ -desmar- (Hendrickx 1999, 2000,2001a,Z00ib), to specifyrhe requind variables.(Seethe downloadableflles ,,chlr_l.do,,and,,ch12_l.log,,tbr details.).{iiir; because-glm- doesnot provideall the coefficientsshownin Tablei2.3 andproduc._.1 incorrect estimateof B1C (given the way I have specifiedthe problem, _g-Lm_ counli r casesthe numberof cells in the tableratherthan ths numberof p^eople in ttre sampte),I b,a

Log-Linear Anatysis 27 5

(ror"gociJness or nr").anda rerseversion, -,.,. '.ll"l:",::,,"i1"j"^^1::"^?:.;;;i3Bil""ii::*::::::l"""ts:these


',_ _- .Jrble file. for lhis chapler _



estimarion command, rurr'rdrru. =_,i:l_-:::T1111orkr,.Juu wirh wrtn , ,::iion: :v:ry ; : ,. becauseit can handlemany kinds of linear moa"r. specify which -nust --r.lel . r: w^,r.,ohr ,r-i-^,1^ """,,l"' -

inailil;il;ffi;] r1e ;J| ;[ . .=':::1:i[:,:::::::j::*l1it .:::j.i."l:.'::11-.lod:l poisson b":1,.r" -r,vlntion distribution;"r.."*"",", 11" ".o"n,." variable. sp".ry'ngil ;;:i"#ffi --::s flr!-rj..:. :',tL::,:"T":ependent a log-linear model. --.: _ g"::

Ji,il; ;iHl;



_. .".j::":.:::-..1: slm_ command the ule Ee'Erdres senerares r r.,:--::lrs ; rz.o. /{1"1 r.1 shown in the ;-l:i,lirst line of,,.r-^Lr,"1: r^ I rnen repeat the process for a model that lull :' -:- : p --r

roc butnotro;nt.ra.tti;;f" ;;;;;r,",t"# J:$; _ .::j ,T9,_1,::lroberelated that

l-:;uefficienrsshoran :o 9' ls,tansllagldcrisii.tn"." commands -- _ -:i].1:.:T::^'"':ll:n'nto on the bolromtine

oftaOte lZ.o ir"l lel g). o./. Clearly, \'rsdrryJ this . trus :r j= _-i :!, rhp.,era L_, :. the data well -,6,, by all crite na..rnd::d so well as to suggestthat a simpler ,r^ model , -__ fln in hr rhp ,.rar.,,,^ll r-. : :> rhe remaining coefficients in the table. ..::cting these statistics, we see that none of these models fits the data adequately.

. ... rhis. resrimare a,,il....i*;.-:i,l;H.; ,.irSll5,o,lli:""llJ:,*::ln

:!*--. I serrleon IARS]IACIIRC][SC] asmy preferred 'n"J.i.'i",a*,fy, age,region, -



..: . .r.

coodness_of_Fit Statistics for Log_Linear Models of the

conmunist shourJ ,. *" speak in ff"^t11:lf:1-*I.l*. hr Community, Age, Region, and Education, "ii"*.o ,.r, ,rn. flcc€i

"a"f,r, Lz





N1 r:;q r t

15_1 :]:;S]IA C]






B :_:,sllRcl -c::lll(at






;- --\)lrALt'KLt






a : ; s l l A cl[5C]











re.:lslJAcl[RC][sC] 2.92






r .:xsllRc][sc]



QuantitativeDataAnalysis:Doing SocialResearch to Testldeas

TA E ,- € tr 2 . 7 , erp..t"d percenrage(from Modet 8) Agreeing That "A Communist Should Be Allowed to Speak in your Cornmunity" by Education, Age, and R€gion, U.S. Adult, 1977. Age


No College


::;;; : :::ri:I;::;: t.::.:..

::t*t*i (253)


(182) 567 (46',)

47.8 (411)



Noter Cellfrequenciesare shown in parenthesses.

andeducationall affectattitudesregardingcomrnunistspeakers. To seewhat theseeffac are, I percentagethe table of frequencies predicted by this model. (To see how to d these,consultthe downloadablefiles, ,,ch12_i.do',and,,ch12_1.log.',) Thesep"r""n,Jr" are shown ]n Table 12.7. The table clearly shows that, controlling for each of the odo. factors,thosewho arebettereducated,younger,andnon-Southemaremorelikely to su_ port the right of a cornmunistto give a speech.In eachcomparison,the percentasediafrencesalwaysgo in thesamedirectionandarequitesubstantial. The attitudes reported here are from thiny years ago. during the height of the Cifl War. It would be of interest to determinewhether the samepattern holds today.To do fui, within a log-linear framework you would need to construct a seconddata set, basedL recentdata(for example,the 2006 GSS),to appendthe seconddata set to the first. trt an additionalvariable(?l for ,.time,,),andthento assess whetherit is necessary ro posrt! effect of time (or of interactions betweentime and any of the two-variable associaticroq to adequatelyrepresentrhe data. That is, you would estimate [ARS][AC]tRCltsf,] tARSltACltRCltSCl[T], and IARSIIACTIIRCTI[SCT], and perhapssomeinrermedi.G models,and comparetheir goodnessof fit. If none of the more elaboiatemodelsproducd a betterfit than tARSltACltRCllSCl, which isjust Model g replicatedfor thepooleda.o you would conclude that attitudes regarding the rights of communistshave not chand between1977and2006.If tARSltACltRCl[SC][T] emergedasthe prefenedmodet. 1ic would concludethat there had been an across-the-boardchange(preiumably an increact in supportfor the civil libertiesof communists.If IARSIIACTIIRCTI [SCi] emergedr the preferred model, you would conclude that the structure of the relationships bet$ ear age, region, and education, respectively,and support for the civil rights of commun:*

Log_LinearAnalysis 2V7 ::r-sed between

1977 and 200 ::r;tude that the wourd strir;,#;:xl;, Tffi::ffi1T:"#ffiffi,:f,Tr""ffii;Jou

h-ing Log-LinearAnalysiswith polytomous Variahles

orcommunists, ff:i",''**r* ;rn$.#*l ri'hts ffjiifi :'"rTtr{|.il".Hffi ;#?:;'"":T"i"H1?ffi



:.*fi ,",ilir:?ffi ilHi"rHt Fir:*,##fifu #:ir;ilffi -rnc ;"-r"^n* il..'.-l;fiiliiiiiJ;'l.l#::y lllociatign if':T'ili"Jll;

race'and membership are dichotomous =o^'.rtionis . ,rrt..-.ror"ri'"fJ, variables. but tto*ts thal we crealetwo durffny vari,f,!i:: s2 { -- | for high schoolgra,i.f1i.uno9"t :

0 otherwise) ands3t: 1ro.ttrose ,E1!; some '19 those wittrlt uiJ =-dl'ri#lT' vrse)' with lackinghigh schoolgmduation "o "g" m:ed category. asthe Suppose we are inrerested ,":.lT:lrl,

a modelin whichrace,

educarion, and exampr"-'" aono,"u." *3:;itll,"lt;fff';""t::?.iJ$il- theprevious abour the membership re-.cr ion amongth'e;;,

and hence

.*v,nrrysrrv_i,r j-d;ffi ",|:jiilffi ',}iil.il*1fi'X"rl'":",1il11#*"# ;;;,1:,", I .:,nd

m'.I1 this model with the _glm_ command: l-n

count r s2 s3 m r vs3 vm, ram iry rpoisl2oJs3

rm s2m s3m rs2m rs3m v vr vs2

ryo=--eachof the compound variables

ls a productrerm_for example, n rL rseethedowntoadable rs2 =r*s2, and files.thrz_r.irrn",oecificarion of ""d:.r,ir-ir"*lil.

r'o,i;,,""r,jl ".,;.,',o."o"" ffij'l*#:;:il;l"fiii;;:?" too"p'"rulv rur provides u.i"rii"i".l *"r,i"i"1,*,?J:Tffii'_'#lJil'jffi""#:i'*1. -. t9 Log-Linear Analysis with tndividual_Level Data

,r:'T,lixlf ffi-;;ffi ililtff il#"#$:i:'":1tti*i,:ilffi t; [email protected] lisstanins rrom ;,;;ff;#Ti{*"*

*ljH ;:l$fff,"il1'nJ"_:y, ::,"::. [email protected]:nd.(Downtoadable file,,chlr_t.do,, shows ,rr"J"iarrl*""irJ"':in,r_r.,on.,,., MONIOUSMODETS r.:r we lave dealtwith _"0"r. *1|:r.r;r-i

global associarion. or absenceof global

n"_.,.".._.,,i.1,d :-;,i";,ffj;1i..?3i**1,":r arhypotheses rike totesr regarding ,n" ":1,:r",.brt.n, ,,"_,",i"'_"lt :er tables can be described ,7,.rii[ll?J,l,,llili"l5;,,::] by relatlvely simple models that generate the observed


DoingSocialResearch to Testldeas DataAnalysis: Quantitative

| ,l A ;.,ii '1 7 ,3. r."q,r.ncy Distribution of voting by Race,Education, and Voluntary Association Membership. Didn't \i:-


n2 1. t




18 6C

Oneor more



Sourcer Adaptedffom Knokeand Burke(1980,Tabe 3).

patternof frequenciesin the table.The developmentof suchmodelsto describepan;:n of intergenerational occupationalmobility hasbeena lively enterpriseoverthe pastlh--: yearsor so,but the lbrmal modelsdevelopedin this contexthaveapplicationsfar bel r,ai the studyof socialmobility (for example,Radeletand Pierce1985;Schwartzand \1,:r 2005;Robefisand Chick 2007;Domanski2008).Still, it is convenientto illustratethss models in the context of mobility analysis.(Seedownloadablefiles "ch12-2.do' :r "chl2_)..log" for details on the Stata proceduresused to estimatethe models in:E remainderof the chapter) It is helpful to begin by deriving a generalexpressionfor log odds ratios. Re:rEquation 12.4, which gives the natural log of expectedfrequenciesfor a two-varil'rd

Log-LinearAnalysis279 From Equation12.4we can write an expresririe as a function of a setof p,parameters. mE tbr the log odds ratio of the expectedfrequenciesfor cells formed from any pair of m. ri andi') andcolumns(j andj') in a two-variabletable: !

or-: -

F..F,... F,,1F,,. loe " " : los v ''J -losfl.-loeE, " Fri - F,iF,j lFri

- loe4, - loeE.,

- (rL+p! + pl + pnc)+(tt+pf + pf,+ pff) @+ pf + pf + pl9)-(p+ pf + pf + uff)


- tf, + pff - ptP- pff


lfher dummy-variablecodingis used,asin Stata's- glm- command,andi' andj' arethe !*r.nce categories,theright sideof Equation12.9simplifiesto Pfc, which makesclear h de interaction parametersrepresentthe log odds ratios for each cell relative to the ,rined categories(ordinarily the first row and first colunn). \ote that to uniquelyidentify the coefficienls,it is necessaryto imposeconstraints. bc differentconstraints,or "normalizations,"are typically used.One is effect coding pal in Equation 12.6andAppendix 12.,4.1, coefficientsas deviations which expresses fu rhe grand total by requiring that the log-form coefflcients for each variable sum to rF:, The otherconstraint,dummy-variablecoding,codesonecategory[in Stata,the flrst csonl of eachvariableaszero.) Il the fully saturatedmodel thereis a unique coefficientfor eachcell of the table q)=t. with dummy-variablecoding, the cells in the first row and first column. This (for a seven-by-seven table): by the following designmatri"x. mdel canbe represented

1 1 1 1111 12 3 4567 l 8 9 15 r 14 21 120 r 26 2728293031 13 2 3334353637

10 16 22

ll 17 23

12 18 24

13 19 25

: full dm

@ lQl

dtr h F 'd ]U lEl ri$

fu rhat a design matrix is simply a variable,with one value per cell, that imposes alFrlit-r'constraintson somesubsetof cells-a1lcells with the samevalueareconstrained :cne equalcoefficients.This designmatrix specifiesthat all the coefficientsfor the first ind first columr areequal;in fact,they are (implicitly) zeroby vinue of the dummyr mble coding.Noneof the remainingcoefficientsis constrainedto be equal.This model m; dl the availableinformation,andthe observedcountsin eachtableare f,t exacdy. \ote that in Stata's-g1m- command,the specification ..:::glm count i.X


i. full _dm, family(poisson)


euantitativeDataAnarysis: DoingSociar Research to Testrdeas

ffi lf ,]i,;f;.';::T.T:rrrT"ibutionoroccuparionbvFatherr Respondentt

Prof. cadre cler. iJes

in 1996 Ser.






producesresultsidentical to the usualway of specifying the saturatedmodel: xi:g1m count i. X*i. y, family (poisson) That is, -gIm- creates a desisn --"b" matnx like that of 'full_dm,,when specified. the interacdcn

,,iil:,T::: yifi:r:il::;i{_j_", ffi a; :.{,i# fi :T}j,T,,: fi *o "?, ^ oo.','lnt"o',"" ; il:::"'ffi':J oio.no;^ ;"i*t:'ff o,. il::::i"t.," women ^i,,.

*.** women roincrease ro I havepooledmen increase ,f,. rhe ."ln","""#, *,0r. L?l juffi;":,fii:"jllf,iJffi:::,..::.# ,TLifll. ,reparatety. s,.iar

two-.wayrabre:tharrr,"*"."_*uuini""ffi ll,: ;'i;3#:r::,il1#:',ily?":."j:::::l':: orathrce-way ,ab,e i* -tt"!:ibrTiiy f::i:;';?f :i,THtr ii!:X":i;yffi j i.'io8i,i '*"iJ:,ffi r,i'^." ro,esr "1,::::1r:-1 i",r-, the

nrsr condi*".,.il;, ff 1 - ,#;.1ff;,ji;liir,l;a,. ""il1,J:fi ttri:L'Ttli:l il;;ffi ,fr ner o:h


,''r.,",,"iJ,.l and women {rhr ?olr,"y,oun.,"" G

nand R=;;;"oJ;l'":#ffi f :,nee .il"l.:H:nm: ;# l':"i?:.rufi ffi,rffi ,:i,ri;'f lffi*,l,,TtH1ffi

.#.iiff +*:f 11X11*"

Log-LinearAnalysis28'l x.raly marginally significant.Given the relatively large sizeofthe sample,I am inclined to ttus on the BIC ratherthan thep-value andconcludethat the first conditionis satisfled. To test the secondcondition, I contrasta model (call this Model B) thar omits the interEion betweensexandfather'soccupation-that is, [SR][FR]-against Model A. The subIllrive argumentfor this is that in China, where almost all women are in the labor force, u€ shouldexpectno differencein the distribution of father's occupationfor employedmen ml rvomen.To contrastthe two models,I take the differencein the 17 and the differencein fu degreesof freedom to get the p-value for the improvementof goodnessof fit resulting frm the addition of [SF] and also get the difference in B1C values.Although the fit of lf,rlel A is significantlybetterby classicalstandards (p : .019IL| - L: = 67.18- 52.03 =15 l5;dl,- dJn:42 - 36 6l).ModelB is morelikelygiventhedaratBle BICA = -185.9 - [-250.6] : -35.3). Again,I am inclinedro put moreweighton theB1Cdiffoence and concludethat the secondcondition is satisfied.Thus I am willing to pool men nl $ omenfor the subsequentanalysis,which effectively doublesthe samplesize. Table 12.10showsthe coefficientsfor the saturatedmodel (see"ch10_2.do,'tosee h* thesecoefficientswere computedusing Stata).As we haveseen,thesecoefficients re not readily interpreted directly. However, in the present caseit may be of interest to .=nnast particular cells in the table. For example,we might ask about the relative chances r de child of an agricultural worker becoming an agricultural worker insteadof a mannal rr orker comparedto the conesponding odds for the child of a manual worker From Ewation 12.9it is evidentthat the log oddsratio can be computedas

to g9=p+f+pt| -pf{-p| f : 2.756+ 1.567- 1.088- .80 = 2.434


dt{h implies that the relativeoddsare 11.40(: e2434)' that is, the childrenof agriculErl workers are more than eleven times more likely to become agricultural workers thselves, rather than becorningmanual workers, than are the children of manual work6x" Similarly,the oddsthat the child of a professionalwill becomea professionalinstead rr -rcorning a cadre,comparedto the correspondingodds for a child of a cadre,are

to g9=,y y +pl f-t"ff-pt =O+.627-0-0 : .677


ffir,-h implies that the relativeodds are 1.87 (: eo621). Clearly,in China (as elsewhere) E -inieritance" of farrn occupations relative to inflow from the children of manual cr*ers is much stronger than the inheritance of professional occupations relative to dt* from the childrenof cadres.


or Levels,Models

filrriag shorn how to interpret the interaction coefficients, I next addresswhether the nilb can be simplified.In pafiicular,given the lack of differentiationbetweensalesand


rXf,af i;:..!S"Interaction Fathert Occupation When R Age 14

parameters for the Saturated Model Applied to Table 12.9.



Respondent,sOccupation in 1996 Clex Sales Ser.



0.054 -15qq1


6. Manual workers







-0.058 0.607

Log_LinearAnatysi,293 serlice workers in the Chinese ecr ronably be collapse;l;;;;"::mv.'

I s:s,pejt that these two categories might rea-

c.u.invorui'g.iieil;#o;;iKTliff ,"'ffiTili:r*;ffi ,*:f,:";#. 1 111111 ::54456 , |


:!2814v15rc :, : ? ! 1rt td

t z z 2 3 2 4 )q 2 s2 6








ti = ss_dm

1 5 i; ZO


16.06,.with 11dr-because oryrwenty-nve orrhe


d^:53),';;'";ffi ;:ni_",,,1T#il;";#:lt3i;fi *:3,:l?;"J,"_? asaseven-bv-seven rable. r,il.;;';il;e



subsequent anarysis

ff i:::::.:-"-:llTpyr:*'voushourdkeepinmindthis you are trying to decide .'"t

"t,Tfl E-egones ofa tabl". Th" o.o".d.tn"never


; J'#':"':".?1T"':::*ffi :"3'lTii: Tf,:Ti:,'#"'fi,'ff:* SliT;i:;" "tt"t


ceilsofa tabre ashaving *#?Ji,,tix1?ill;:}f,ffj|t:iu":.:f panicurar identicar

m.*".pL.,',""ir"#;'il;?,1'r"r.j:;:ff :.T:lTH,:;,,[:n".J*r""ir-"]ir, Qnsi-lndependenceModels



if georr.e areabre tofreethemserves fromthesociar

u:n*::ll.* *lhlt""lxiliiT:""Ff,1,;:i*t..';i:t1:?fJ:ffi

(onthe hpothesis couuo,"a .i*_oulJ*.,-11!l-il'o."ifi [Uffi:Hffi:1""}.;fffi1ff:

{egonar ce's of thetablebut otherwrse fbrcesa'interaction parameters to be



r3r1ii 114111 111511 llll6l lltt;;

= diag_dm

.Asse canseefromthefirstrowol^tf s3c3nd.panel of Table12.1l,thismodelis a huge rprovemenrovertheindependencl mgde],*fri"f, t ,fr" U"*U-r"model in Thble12.11. lbough it doesnotfit by ciassical stanou.os. ir is mor. tit"i, ,ir_ *i'**r,"0 moderand rnisclassifies about2 percent of thecases. S l. other.JO.llr"i-gh,ht evenberter -r'



': ,:.,I :

statislics for Alternative Models Goodness-of-Fit in China (Six'by-sixTable)' Mobility Occupational of Intergenerational







': fi





nj 2


' :w '





25 20

58.8 oJ+







urban hukou Line!r-by-linear,







Linear-by-linear, lSElr urbanhukou




* 43.8


I !i








Row_andcolumne{fectsll (RC)





- 106







' :

- 11a



- 14

Diagonalcellsfitted exactly









)o I







l)tl Llnear-oy-lrnear,




urbanhukou Linear-by-linear,






.Linear-bylinear, t) + uroannuKou




- 10q








Row and columneffectsll (Rc)



_ ' a



t important issue in social mobility researchis whether' net of any shift in the G:-sinals, the relative odds of upward and downward mobility betweencorrespondn: !ategoriesare symmetrical.The following design matrix specifiesthis model for te :ir-by-six table: :11111 3 1 8 I i917 10 1 1li14

9 12 515 15 16

8 4 13

1l 14 16 17 7

10 13 6 17

: qi-dm

ts ;\e seein Table 12.11,this model fits slightly better than the quasi-independence riel by the likelihood ratio standardbut not nearly so well by the BlC standard'

CmssingsModels tableaslepresentwe wereto takethe occupationalcategoriesin our six-by_-six Sdr.r,ose Supposefurther mobility barriersto = .ocial classes,with boundariesthat constitute "cross" eachbarto llrl in an analogyto movementacrossphysicalspace,it is necessary E ttween adjacentclassesto achievemobility betweennonadjacentclassesWe can sr::ent this model (following PowersandXie 2000, 117)as


F,,= nrlrl ufc

riuu fori > i j- l


uu fori < i




fu* ,pecificationimplies the followhg interactionpalametersfol the cells of the six-by-srx mie rivith the diagonalcells fitted exactly): q1






to Testldeas DoingsocialResearch QuantitativeDataAnalysis:


one for eac| These parameters can be estimated by summing six design matrices' and taking parameterplus one for the diagonal design matrix -(diag-dm)' ir estimated is exactly "*ring. diagonal the fit not antitogs. f.lle conesponding model that does desigr five the are Here omitted' is matrix ttrat ttre diagonat design th" ,ui *uy " "*."pi crossingsparameters: matrices for the 011111 100000 100000 100000 100000 100000 crl-dm

001111 001111 110000 110000 110000 110000 ct2-dm

000111 000111 000111 111000 111000 111000 cr3-dm

000011 000011 000011 000011 111100 111100 cr4-dm

000001 000001 000001 000001 000001 111110 cr5 dm


rfr @4

the othermodels'rc As we seein Table12.11,the crossingsmodelfits betterthan any of degradesthe ft exactly cells diagonal the have reviewed so far. Interestingly, ntting

movingb"tY":t rt:,jiii:li because presumablv .tigt,tyUym" AC standard,



-0.138 0.002 -o.203 -0.228 -1.033

farm and nonfarn Clearly, by far the most difficult transition (crossing) is between and China is m everywhere' is true this o".uputOnt (specifically,manualoccupations); cadre and clericd between is exception. Interestingly, the least difficult transition distincticr sharp no Chinain occupations.Again, this is no particular surprise, because of tbr the brightest and best is made between clerical and administrative tasks and the mobilig clerical staff are often tapped to become cadres' The known intragenerational positions seenas pa$em may well carry over to intergenerationalmobility,.with clerical cadre positions ieasonable starting points for the children of administrative cadres and Finally' thb as aftainable upwld mobility goals for the children of clerjcal workers' lt could females and males combines here result could be due to the fact that the analysis workers' clerical to become tend well be that the daughtersof cadresdisproportionately

lJniform AssociationModels



weil by the crossingsparameters,ard the additional degreesd i. Jiu"gona "uptured ""U, freJdom usedby fitting the diagonal ex actly arc penalizedby BIC ' The crossingsparametersfor the simpler crossingsmodel are




parsimonious When the cateSoriesof a table are ordered.ir is possiblelo eslimalemore model assumesthl such simplest The models than are available for nominal categories'


EL:r !i


*t I5lrtrE

r [.d l&lr |d:r


m-{ @ dEF -trtd


Log-LinearAnalysis297 te.differencebetweeneachpair of adjacent categoriesis equar,so thatthe scalefor each uiable can be represented by consecutiveintegJrs.rr,"t iiii" .#r i,

togF..= p+ p! + pf + Bij

(12.13 )

rtere the strengthof the association betweenthe row level and the colur* level is '"-;red by F From this it follows that the log odds.",i" u"."#.-"* .ategories r and .e: .olunn categodesjandj, is just '

to g 9 =B G-0 U -l )


Table12 11 showsgoodness_of_fit statisticsfor the uniform assoclation model with .rl . c.ithourrhe main diasonalfitted exactly.As y"" .;;,;;;;;iagonal cells are nor ft :ractly, the uniform aisociationmodel hts u".y luAfy. ffr" ."^on tbr this is simple; F'ole disproponionatery tend to remainin the sameoccirpu,#ii ,heir fathers. Eh. tendencyis capturedbv fitti "go.y ^

p,.r,..",,L',."J;i;""i#':,,1f, ,il""*HiJffi ffH',:u:ii;?:T:i, :,ft::[:

&gonal cellsare estimatedexacflv. when the diagonal cells*areestimatedexactly, {y.""t rhe

umlbm association *:ll. It vieldsB : .046.FromEquatio;;l;;;;; ."" tharthisimplies, TII!l':"1t: e:\ample, that the log oddsthat the child of a professiooutrvilti""o_" u protbssional the corresponairg for the child of 1.150; s,,50: 3t5S-i;:ii.i",i"i "OO.

ffier than a farmer are more than_threedmes

nnrmer:.046(1 - 6x1 - 6) = ,"*,nerlow odds m,-'' whichis consisrenr with thegeneralr"nr" ,rruiini".g"iJrationar mobilityin is easier thanin mostorhernaionsdr;:;;w;#;;:#ii, ^n:..a 20071 trooo, fora .w:erargument). tfua r-Sy-1in"" r Association Models

. cr ruppose we have more information than simply a rank order of categories_for **rple, socioeconomic statusscores.We can then estimateu iin"*-Oy_In"* u..o"iu_ m nodel, wherethe scalescoresaresubstituted for thec","g"fi"O"*r. LLrat is, instead rLluation 12.13,we have

logF,, : 1t+ pf + Lrf + p\yj


*fr ae log oddsratio givenby 1og0:B e,-x)(t1



Esrimatingthis model for the,Chinese data,with occupationcaregonesscored by filhr meanoccupationalstatus(ISEI; see Ganzeboom,O.'Cr""i ano Treiman 1992).


to Testldeas DataAnalysis: DoingSocialResearch Quantitative

we achievea model that fits marginally better than the uniform associationmodel. bg B1C criterion. For this model, B : .000483. Thus for the samecategoriesas in the form associationexarnple,we have.000483(16.2- 63.7)(16.2- 63.7) = 1.090;e:! 2.974.We areherebyled to a quaiitativelysimilarconclusion:the oddsthat the child professionalwill becomea professionalrather than a farmer are about threetimes as as the correspondingodds for the child of a farmer. Note that it is possible to include more than one scaling of the categoriesof a to representdifferentconcepts.Table12.11 showsgoodness-of-fitstatisticsfor two tional linear-byJinear models, one of which scales occupations by the proportiu incumbentswho havepermanenturban regisffation (urban hukou) andthe other of usesboth the ISEI and urban registration measures.As it happens,neither fits as rr-ell the ISEI and uniform associationmodels.However.if we wishedto assessthe los ratio using, say,the model that includes both measures,we would simply apply 12.16to both variablesand computethe sum. (For a well-known applicationof this ki model,seeHout 1984.)

Row- Effects(and Column- Effects)Models Sometimeswe are confident that one variable can be scoredwith an integer scale--fr1 is, that the difference between each pair of adjacent categoriesis the same-but we l uncertair about how to order the other variabie. ln such cases we can estimate tr untnown scores.In this model the expectedfrequenciesare given by

logFij = tt+ p! + LLf+ ift


where thej index the categoriesof one variable and the d. are the estimatedscale sctc for the othervariable.The los oddsratio is sivenbv log0:tS,-fi.t\j-j')


As an example of a situation in which theseconditions might hold, considerthe r* tionship betweensize of place of origin and educationalattainment,for the 1996 Chirp surveywe havebeenusing. Table 12.12showsthe bivariatefrequencydistributionfu adultsnot currentlyattendingschool.In constructingthistable,I havecollapsededucarir so that the categoriesrepresent approximate three-year intervals in median schooli4. The size-of-place categories are from the official administrative hierarchy of Chir. which sffongly affects the flow of resourcesto places. Thus, in addition to the geDcd advantageof urban residencefor educational attainment (greaterexposureto the wrira word and such), we would expect educational attainment to be greater for placeshi-sh in the administrativehierarchybecausesuchplacesarc the beneficiariesof more resourclr from the central govemment. The row-effects model fits well (BIC : -135, L : 2.96) although not ! classical inference (p < .000). But contrary to my expectation, the estimated scrEi

Log-Linear Analysis Ll j-





:': .-

'.1 :?,1?, FrequencyDistribution of EducationatAttainment Size of Place of Residence at Age Fourteen, ChineseAdults Not Enrolled t rr Seto make too much of this becausethe confidenceintervalsoverlap(the 95 perqr:onfidence intervalis 0.71 to 1.01 for county-levelcities and 0.63 to 0.84 for re:::ture-level cities).


QuantitativeDataAnalysis:Doing SocialResearch to Testldeas

Column-effectsmodels are formally identical to row_effectsmodels, but with role of rows and columnsreversed.A columa_effecrs model of the relationshipbetc sizeof placeat agefourteenand educationalattainmentdoesnot fit as well as the c{ spondingrow-effecrsmodel (B1C: - 108,A : 2.98, andp < .000),which suggests the_assumption of equal scaledifferencesbetweenadjacentsize_oi_ptace categories probably inconect. This is hardly surprising given the dlviation from equal diff.erencc: the estimatedcoefflcients for size-of-placecategoriesin the row_effects model and. cially, the non-monotonicity of the scoresrelative to my a priori ordenng. Row-and-Column-Effects Model I Another analytic possibility is to treat both the andcolumneffectsscoresasunknownquantitiesto beistimated.However, in this cr is important to have the correct ordering of both the row and column categoflesbe, the results are not invariant under different orderings. For the Chinese example we been exploring-the relationship between the size ;f the place of origin and educaticd attarnment-this createsa bit of a dilemma. Is it better to reorder the size_of_placectr gories according to the scale scoresderived from the row effects model or to retain rb.l priori orderingderivedfrom the Chineseadministrativehierarchy? One possibilityL.bney 1999,7-8)i and as noted,8/C is not available rre optimal solution may be to treat clusteredsamplesin a multilevelcontext, estimating er*rerfixed- or random-effectsmodels(Mason2001),which can be done in Statausingthe go beyondwhat can be coveredin this book, -:{t- or -gee- command;theseprocedures to Althoughevennow much 3(n seeChapterSixteenfor a briefintroduction multilevelanalysis. . -':ra:-_--l-journals, and treats simplyignorescomplexsanpledesrgns rat is published, evenin ldading this is generallyinappropriate cata as if they were generatedby randomsamplingprocedures, in its variForthe oresent,lsuqqestfor loqisticreqression 4d can leadto incorredinferences.

:us formsthatwhenyouhavedatathatareweightedorclustqedyoucarryouty!u!-estimatigl '.-:-relvon adtustedWald tettilor modelselection. +:---

iindit-tata3 survevestimalioncommandsand + Onlywhere 3e cautioui,however,in your interpretationand exploreallernativespecifications. you usethe - logistic - commandand random sample should have a true, unweighted, -,ou ikelihoodratio test (-lrtest) . Further,wheneverpossible,eschewweightingin favor of rxluding the variablesusedto createthe weightsin the model.


niquesfor makingthe generalshapeof a distribution clearby removng " no ," " -d"ui"tions $ from the underlyingtrend that resultfrom samplingerroror id osyncratic factors.Perhapsthe simplest smoothertsa movingaverage. A movingaverageis the average valueof several consecutlve data points.Considerthe workedexamplein this secton. A three-year moving averageof the expectedprobability of marriageat eachage would be constructed by first takingthe averageof the expectedprobablitlesfor agesfifteen,sixteen,and seventeen; thenthe average of the expectedprobabilities for agess xteen,seventeen, and eighteen; and so on. At the time the age-at-first-marriage examplewas created,the Statasubcommand -ma- ("movingaverage")was available within the -egen command.However, this sub(although commandisno longerdocumented in Stata10 it stillworks),andhasbeenreplaced by smooth . whichgenerates mediansof the lncludedpointsratherthan means.Another smootheravailable in Statais -lowess- .

il .tt

a'n-Blacks (precisely,0.591 : 0.190*3.108).Among 3O-year-old never-married people, fu oddsof marryingin that year amongthosewhosemothersare collegegraduatesare r:rlv 10percenthigherthan the oddsfor thoseof the sameraceand sexwhosemothers (precisely, r: highschoolgraduates 1.094: (0.918*1.114)4). Despitethe usefulnessof Table13.6for making specificcontrasts,the overallpattem qlied by the coefficientsis difficult to discem.Again, graphshelp. Figures 13.4 and of the expectedprobabilityof first marriageby age --:-i showthree-yearmovingaverages f, isk. separatelyfor Blacksandnon-Blacks.In eachgraph,separatelines are shownfor tri.r-esand femaleswhosemothershad twelve and sixteenyearsof schooling(as a con|€=ientway of visually representingthe effect of mother'seducation).Moving averages r: shownbecausethereis a greatdealof "float andbounce"for individualyears,which F lident from inspectionof the coefficientsin Table 13.6.(Seethe downloadablefile *:13_2.do" for details how on the moving averageswereconstructed.) InspectingFigures 13.4 and 13.5, we see that mariage rates for Blacks differ mnntially from thosefor non-Blacks,with Blacks much less likely than non-Blacks @:lirry at all. Moreover,non-Blackfemales(especiallythosewhosemothershaveonly ,I rsi schooleducation)marry at disproportionately high ratesat agesnineteenthrough lDdxn -five; non-Black males marry a bit later and with less concentrationin a short FL{. Black marriagerates,by contrast,are spreadout over a much longerperiod,but rrE ar upsurgein marriageratesfor malesin their thirties,especiallythosewhosemothG ire high schooleducated.For both Blacks andnon-Blacks,malestend to marry later k remales,with male ratesexceedlngthoseof femalesbeginningaroundage thirty. lirrJl. amongall race-by-sexgroups,thosewhosemothersarehigh schoolgraduatesare m3 likely to marry than are thosewhosemothersare collegegraduates. Ir I werepreparingtheseresultsfor publication,I would presentonly a subsetof the fter large set of tablesand graphswe havejust marchedthrough.The intent here,of

to Testldeas QuantitativeDataAnalysis:DoingsocialResearch


. 18 . 16

F e m a l e(s1 2 ) -.----o- Males(12) F e m a l e(s1 6 ) --_ Males(.16)


,/ . 14


i .os E p .u o

.04 .02 0 15











Age at nsk

PtGtJ*i: 13'4, r"pecteaProbabititvof Marrvingfor the FiRt Timebv Ase Sex,andMother'sEducation(Twelveand sixteenYearsof Schooling)' at Risk, U.S.Adults,1994. Non-Black

. 18 . 16 . 14

(12) Fem ales - o.---o- Males(12) (16) Fem ales --Mates(16) -

E b

9 .oe € .o o


.04 .02 0 19







Age at nsK

of Marryingfor theFirstfimebyAge Fl€URg 13'$. etpect"aProbabitity at Risk,Sex,andMother'sEducation(TwelveandSixteenYea'sof Schooling)' BlackU.s.Adults,1994.

;d nl !t

BinomialLogisticRegression 327 is to providealtemativesfor you to considerwhenpresentingyour own analyses. of the application of discrete-time hazard-ratemodels include Astone and oth1J00),Dawson(2000),Lewis andOppenheimer(2000),and Sweeney(2002).

FOURTHWORKEDEXAMPLE(CASE-CONTROL MODELS):WHO APPOINTED TO A NOMENKLATURA POSITIONIN RUSSIA? a dependentvariableis a rareevent,it is inefficientto draw a representative sample populationat risk for the event,becausethe samplesizewould haveto be extremely to obtainenough"positive"casesto analyze.This is a frequentoccurrencein epideical research,where the eventsof interestare diseases,but it also occursin the :ciences.For example,if we are interestedin studyingwhat determineswho gets to Congress,we could hardly do this by drawing a representative sampleof the ion andlooking for the congressmen in it. We havesimilar problemsin studvins crime victimization,homosexuality,and variousotherrelativelyuncommonpheOne solutionto this problemis to sampleon the dependentvariable(that is, to a sampleof congressmen, criminals,or homosexuals), collect informationon that collect oie. correspondinginformationon a representative sampleof the population 'lrs not experiencedthe rareevent(becomingcongressmen,criminals,or homosexuals), the two samples,and model the odds of experiencingthe rare event.This is ascase-controlsampllngin the epidemiologicalliterature(for an excellentreview itatisticalproceduresinvolved,seeBreslow [1996]). C3-ie-controlsampling exploits the fact that odds ratios are invariant under shifts distributionof the data.This extremelyimportantfeatureof oddsratios makesit to combine sampleswith very different distributionson the independentand variablein orderto modelrareevents.This capabilityis not possiblewith OLS becauseOLS coefficientsare affectedby the distributionsof the variablesin n]del. T.r see how case control procedures work in practice, let us consider what factors

the oddsof becominga memberof the Russianpolitical elite at the end of the ist era. From Social Stratification in Eostern Europe after 1989 (Treiman and samplesfrom Russia:a probabilitysampleof 1i 1993),we havetwo representative ,< population(N : 5,002)and a randomsampleof personswho werein nomenpositionsasof January1988(N = 850).(SeeAppendixA for a descriptionof the .md informationon how to obtain them.)Nomenklaturapositionswere thosethat the approvalof the CentralCommitteeof the Communistparty. They ranged rery high govemmentofficials (for example,membersof the politburo) down to of sensitiveorganizations-for example,rectorsof universities,editorsin chief of newspapers, andheadsof largeindustrialenterprises. Th generalpopulation sample departsin two ways from compliancewith the ions underlyingcase-controlsampling,but neitherdeviationis importantfrom standpoint.First, it is a probability sampleof the 1993 populationrather tb 1988population.However,the samplingframe is basedon the lg89 census,and nmple thereforeprobablyrepresentsthe 1988populationnearly as well as it does




Before tuming to interpretation of the results, we should note the one difference hween case-controlanalysisand ordinarybinomial logistic regression:in case-control aalysis the intercept is not meaningful. This should be obvious from the fact that the in logistic regressionindicates the proportion of the sample that is "positive" rid respectto the dependentvariable. However, in case-controldesignsthis proportion -ercept b ixed by the sampledesign, and thus the coefficient addsno information. Inspectingthe coefficientsin Table13.7,we seevery largeeffectsandfew surprises. Ech year of schoolingincreasesthe odds of becominga memberof the nomenklatura b more than 70 percent. Thus, all else equal, university graduates(who typically have Li yearsof schoolingin Russia)are more than 15 times as likely as high schoolgradu(with 10 yearsof schooling)to be appointedto nomenklaturapositions(precisely, -s l5-i2 : 1.72605r0)).The effect of genderis astronomical:malesare more than 17 times G likely as females to be appointed to nomenklatura posts. The effect of age is also anemely strong: all else equal, the odds of being appointed to a nomenklatura posrtion i;rease about 14 percentper year.Thus, for example,a SO-year-oldis more than 7 times hkely to securea nomenkhtura positionasis a 35-year-old(precisely,7.23 : 1.141(50-35)). -Itrhaps more interesting, the effect of social origins, evenamong thoseequally well educred, is far from trivial. Coming from a family in which one's father was a memberof the Communist Party improves one's chancesof a nomenHntura appointmentby about half, d elseequal.Also, eachyear of father'sschoolingincreasesthe oddsof nomenklatur.l qpointment by about 11 percent-this in the worker's paradise!-so that the offspring of t university-educatedintelligentsia (15 years of school) are about three times as likely * the offspring of those with only a primary educationto sectJrenomenklatura apporntlmts, inespectiveof their own educationalachievement(precisely,294 : 1.114(s5)). rllone amongthe variableswe haveconsidered,father's occupationalstatushasno impact r the odds of appointmentto a nomenklatura post.

XHAT THISCHAPTER HASSHOWN h dis chapterwe have seenhow to estimateand interpret binary logistic regressionmodds- which are widely usedto model dichotomousoutcomessuch as whether people vote, employed, or are members of a particular organization. We have seenthat although t- estimationproceduresare quite different, the interpretation of the coefflcients of such ndels is similar to that of OLS regression, except that the coefficients represent net &cts of eachindependentvariable on the log odds of an outcome. Because log odds are not intuitive quantities, we have considered two nonlinear :nsformations to more readily interpretable coefficients----oddsand expected problllities-and have also seenhow to graph net relationships, a form of regressionstanfor logistic regression.Finally, we consideredthree extensionsof the basic &ization models, listic regressionmodel:educationprogressionratios,discrete-timehazard-rate d case-controlmodels.A notable feature of logistic regressionmodels is that they are with respectto the distributions of variables in the sample,which is what makes procedureslegitimate in the logistic regressioncontext blrt not in the OLS Ge{ontrol -aiant rlresslon conlexr.


Quantitative DataAnalysis: DoingSocialResearch to Testldeas

APPENDIX I3,A SOM: AIGIBRA FORLOGSAND EXPONENTS who have forgotten their school algebra, here are some usefirl Io. 9lr" rnvorvlngnaturallogarithmsandantilogs(exponents): e'lt) : X

h(x*r):h(x)+h(r) ln(X /I) : tn(X) - ln(f) X * Y : et^6) etn(Y): e(L(x)+ln(Y) e :'essionalin 1988

0.9943 (.1s48) .000

1.124 (.2990) .000

1.3856 (.3s77) .000

- 5.5378 (.3021) .000

-8.I 541 (.5036) .000

- 10.1965 (.5866) .000

!: ::1

t : :r

y'ariable -: r ts (b)

:.___ :. _



: , .'; :

: {li

. :-. -': -- l-


: r* :

-'= l

:'' -u :::-_.---:




-28.3602 (.7039) .000

:u ':L-:il

::: multipliers(d) '::.s of school ::-oleted




: :- a Communtst Party -:-ber?






QuantitativeDataAnalysis:Doing SocialResearch to Testtdeas


1,6, t , ef."t parametersfor a Modetof the Determinants of

English and Russian Language Competence in the Czech Repubti(, 1993 (N : 3,945). (Standard Errors in parentheses; p_values in ttaiic.) (Continued) Variable


Othermanagerin l98g




I about a thfudhigher than the odds that they spokeneither language, whereasthe oddsdlx Communist Party membersspoke English but not Russian i" Jniy uOout40 percenrr great as the odds that they spoke neither language.Thus the odds that Communist Fa.l membersspokeRussianbut not English are more than thre" ti*", u, gr.ui u. tt;-;il ; they spokeEnglish but not Russian(becauseecozor-.txr)) : 1.35410.40g = 3.316).Th sameis trxe of service as a govemment or Cornmunist party official. Here, as expecr.;, officials were nearly five times as likely to speakRussian urt to speakingneitba 1ln Russian nor English) than were those who were neither "ont managers nor proibssionalsl: technicians(recall that the referencecategoryis all other occ-upatrons). The odds d;r g?YeyTrent officials spoke English or both Russian and English are effectively zero_ which they should be becausenot one of the sixteenofflcials ii the samplespokeEnelin Fin{1y, yi seethat being a professionalor technicianin 19gg roughly triples the od6 d speakingRussianonly or English only, andqladruples the oOAs of spJakingbottrEnglishmd Rl^ssian,relative to speakingneither English no. Russian. By coniast, being a managern triples ttre odds of speakingRussianonly, relative to speaking neither.Bur c l3!9- ":t,l erect or bernga manageron the oddsof speakingEnglish or of speakingboth English mc Russianwereboth somewhatsmallerthanttte effectof 6eing a on tt oddsof spez&ing Russian.A1so,the coefficientsareonly marginally signi-ficant_alt " 0.1 --ig", aboutthe level. Althoughfor this exampleI settledon a singlemo-delin advance, model selectionfir . multinomial logit modelsis carried out in exactly the same way as fbr binomial lcrs_ models-by taking the ratio of the differencein Z;s (Modef XrO'to tfr" Oiff"..;;;;; de^grees_of freedom for any two models,to determinewhether one model fits the data srcnilicantly betterthan the othermodel (but recall that this p."""d";ir;;;;;;;;;; robustestimationis used-that is, whenthe dataareweighted or clustered;rather,a \\-ais testshouldbe usedto comparemodels).

lndependenceof IrrelevantAlternatives In the_multinomial logit model, the relative odds of being in two categonesare assumedr be independentof the other altemarivesincluded in the riodel. This fJllows from Equari.r 14.1,flom which we canderivethedifferencein log odds for two categories, d andc, a.

Multinomial and Ordinal LogisticRegression and Tobit Regressio n

.'LuurJ ''[##J:1"*2u,"r) 1",

Bot B E:


'-'t be --ia :rt rr [ -::-r fmr nr-< 6.



-E - - - _ -,f 3: 3L:P-JdtL-:-::€liF

fe.:: --:,:,. n Tx be s:! n3-l .-:nk: E:-::* s c-r. .\i'.5 ir h Er::-:-: a a T;i;::


i6er. B:: @ r Ergr-: m dL. o: :-n0,1ie'.::r' --le:-::: ino mr:.:'s iete;i;e :: :it ; de ia- q. f,rs'ible tiE' 121i131-3 \\ arD

re LisuiDei


ion Equ::.:r . .imdi. =



\.rte that only the two categoriesbeing comparedenterthe equation.If, however,the rela::;e odds do depend on what the altematives are, the model produces misleading srimates.To seelhis clearly,considerMcFadden's(1974) well-knownexampleof transa-rtation choice. Supposepeoplecan travel to work by bus or by car and that half choose -t go by car and half by bus.Now supposea competingbus companyestablishes buses r:ih the sameroutesandschedule,so we no longerhave,say,only blue busesbut alsored r.es. Presumably,the half that traveled by car would continue to do so, but the half that :-.r'eled by bus would divide equally between the red and blue buses,taking whichever ri showed up first at the bus stop. Thus the odds ratio for car versus blue-bus riership would changefrom i:1 to 2:1, violatingthe assumptionof the model. Now consideranotherexample.Supposetherearetwo restaurantsin a neighborhood,a ![erican andan Italian restaurant,andthat the Mexican restaurantgets60 percentof the total r-.iness. Then a new Chineserestaurantopensin the neighborhoodanddrawsoff 20 percent :idre businessof the Mexican restauant and20 percentofthe businessofthe Italian restau:::]r The Mexicanrestaurant'sshareof thetotal is now 48 percent,andthe Italian restaurant's (trA) ;;re of the total is 32 percent. Here the independence-of-irrelevant-altematives rsrmption holdsbecause60/40 : 48/32 : 312. Becausethe multinomial model is misleadingwhen the IIA assumptionis violated, \(;Fadden suggeststhat multinornial(andconditional)logisticregressionmodelsshould :E estimatedonly when the outcomecategories"can plausiblybe assumedto be distinct md weighedindependentlyin the eyesofeach decisionmaker" (1974,I13). A formal testofthe IIA propertyis available,implementedin Stata10.0as suest-r-emingly unrelatedestimation,"a generalization of an earliercommand,-hausman-). la€ -suest- test comparesmodelsthat do and do not include presumablyirrelevant :qicomes.If the resultingparametersfor the restrictedanduffestrictedmodelsare simi-::- the additionaloutcomescan be assumedto be irrelevant.Applying theseideasto our ::rrent example,we might ask whetherthe oddsthat peoplespeakEnglish are affected f. including "Russian" as an alternativein the model. In this case the test strongly ;.sgests that the IIA condition is not satisfied.Thus we might considerestimating r,equential logit model in which we successivelyconsidertwo {uestions:whethera =spondentspeakseither Russianor English versusspeakingneitherlanguage,and for :L'h of the two subsetsof respondents-thosespeakingRussianand those speaking 1:,glish-whether they speakthe otherlanguageaswell. For fulher discussionof the IIA assumptionand its consequences. seeMcFadden (1988). (1984), Hoffman Hausman and McFadden and Duncan Zhang and -97.1), (1993), (1997,182-184), (2000. Long Powers and Xie 215 247). Long and -;:frman (2007). (2006), -suesti=ese andthe -hausman- and entriesin Statacorp Addi:rroal examplesof the applicationof multinomial logit modelsincludeAIl and Shields (1999t.and Breen and Skag-es ,991),Haynesand Jacobs(1994),Tomaskovic-Devey (2000). rcd Jonsson


DataAnalysis: Quantitative DoingSocialResearch to Testldeas

ORDINATTOGISTIC REGRESSTON Often in the social scienceswe haveordinal dependentvariables,wherethe response ries canbe orderedon somedimensionbut wherethedistancebetweencateeorieiis ur Most attitude variablesare of this sort. For example,if people are askedto say how hfl lhey are, and the responsecategoriesinclude ,.veryhappy,',,,prettyhappy,,'and .,ncr: happy,"there is no ambiguity in assumingthat those who say they are ,.pretty happllesshappythan thosewho saythey are'\;ery happy',andaremorehappythan thoseu.bc, they are "not too happy."However,thereis no basisfor assumingthat the distance "not too happy" and "pretty happy" is the sameasthe distancebetween..prettyhapp\,'1ery happy." Many other aftinrde scaleshave similar properties.In such caseswe o predict the scalescoreusing ordinary least-squares regression.However,to do so wouk tantamountto assumingthat the distancebetweenresponsecategoriesis uniform. (For a ful discussionof this andotherpoints, seeWinship andMare [1984].) An altemativeis to estlmatean ordinal logit eqtJation,which makesuseof the property of the responsecategorieson the dependentvariable but makesno at all abouttherelativedistancesbetweencategories. The basicassumptionof the ord logit model is that thereis an unobservedcontinuousdependentvariable,f*. whicb linearfunctionof a setof independentvariables: Y* :


Db jx j + p

However,what is observedis a setof orderedcategories,y : 1 . .. { suchthat Y:Iif-cn3Y*1kr -Z rf kt