Handbook of the Economics of Marketing: Marketing and Economics 0444637591, 9780444637598

Handbook of the Economics of Marketing, Volume One: Marketing and Economics mixes empirical work in industrial organizat

970 165 6MB

English Pages 632 [619] Year 2019

Report DMCA / Copyright


Polecaj historie

Handbook of the Economics of Marketing: Marketing and Economics
 0444637591, 9780444637598

Table of contents :
Handbook of the
of Marketing,
Volume 1
1 Microeconometric models of consumer demand
1 Introduction
2 Empirical regularities in shopping behavior: The CPG laboratory
3 The neoclassical derivation of an empirical model of individual consumer demand
3.1 The neoclassical model of demand with binding, non-negativity constraints
3.1.1 Estimation challenges with the neoclassical model
3.1.2 Example: Quadratic utility
3.1.3 Example: Linear expenditure system (LES)
3.1.4 Example: Translated CES utility
3.1.5 Virtual prices and the dual approach
3.1.6 Example: Indirect translog utility
3.2 The discrete/continuous product choice restriction in the neoclassical model
3.2.1 The primal problem
3.2.2 Example: Translated CES utility
3.2.3 Example: The dual problem with indirect translog utility
3.2.4 Promotion response: Empirical findings using the discrete/continuous demand model
3.3 Indivisibility and the pure discrete choice restriction in the neoclassical model
3.3.1 A neoclassical derivation of the pure discrete choice model of demand
3.3.2 The standard pure discrete choice model of demand
4 Some extensions to the typical neoclassical specifications
4.1 Income effects
4.1.1 A non-homothetic discrete choice model
4.2 Complementary goods
4.2.1 Complementarity between products within a commodity group
4.2.2 Complementarity between commodity groups (multi-category models)
Example: Perfect substitutes within a commodity group
4.3 Discrete package sizes and non-linear pricing
4.3.1 Expand the choice set
4.3.2 Models of pack size choice
5 Moving beyond the basic neoclassical framework
5.1 Stock-piling, purchase incidence, and dynamic behavior
5.1.1 Stock-piling and exogenous consumption
5.1.2 Stock-piling and endogenous consumption
5.1.3 Empirical findings with stock-piling models
5.2 The endogeneity of marketing variables
5.2.1 Incorporating the supply side: A structural approach
5.2.2 Incorporating the supply side: A reduced-form approach
5.3 Behavioral economics
5.3.1 The fungibility of income
5.3.2 Social preferences
6 Conclusions
2 Inference for marketing decisions
1 Introduction
2 Frameworks for inference
2.1 A brief review of statistical properties of estimators
2.2 Distributional assumptions
2.3 Likelihood and the MLE
2.4 Bayesian approaches
2.4.1 The prior
2.4.2 Bayesian computation
2.5 Inference based on stochastic search vs. gradient-based optimization
2.6 Decision theory
2.6.1 Firms profits as a loss function
2.6.2 Valuation of information sets
2.7 Non-likelihood-based approaches
2.7.1 Method of moments approaches
2.7.2 Ad hoc approaches
2.8 Evaluating models
3 Heterogeneity
3.1 Fixed and random effects
Mixed logit models
3.2 Bayesian approach and hierarchical models
3.2.1 A generic hierarchical approach
3.2.2 Adaptive shrinkage
3.2.3 MCMC schemes
3.2.4 Fixed vs. random effects
3.2.5 First stage priors
Normal prior
Mixture of normals prior
3.2.6 Dirichlet process priors
3.2.7 Discrete first stage priors
3.2.8 Conclusions
3.3 Big data and hierarchical models
3.4 ML and hierarchical models
4 Causal inference and experimentation
4.1 The problem of observational data
4.2 The fundamental problem of causal inference
4.3 Randomized experimentation
4.4 Further limitations of randomized experiments
4.4.1 Compliance in marketing applications of RCTs
4.4.2 The Behrens-Fisher problem
4.5 Other control methods
4.5.1 Propensity scores
4.5.2 Panel data and selection on unobservables
4.5.3 Geographically based controls
4.6 Regression discontinuity designs
4.7 Randomized experimentation vs. control strategies
4.8 Moving beyond average effects
5 Instruments and endogeneity
5.1 The omitted variables interpretation of "endogeneity" bias
5.2 Endogeneity and omitted variable bias
5.3 IV methods
5.3.1 The linear case
5.3.2 Method of moments and 2SLS
5.4 Control functions as a general approach
5.5 Sampling distributions
5.6 Instrument validity
5.7 The weak instruments problem
5.7.1 Linear models
5.7.2 Choice models
5.8 Conclusions regarding the statistical properties of IV estimators
5.9 Endogeneity in models of consumer demand
5.9.1 Price endogeneity
5.9.2 Conclusions regarding price endogeneity
5.10 Advertising, promotion, and other non-price variables
5.11 Model evaluation
6 Conclusions
3 Economic foundations of conjoint analysis
1 Introduction
2 Conjoint analysis
2.1 Discrete choices
2.2 Volumetric choices
2.3 Computing expected demand
2.4 Heterogeneity
2.5 Market-level predictions
2.6 Indirect utility function
3 Measures of economic value
3.1 Willingness to pay (WTP)
3.1.1 WTP for discrete choice
3.1.2 WTP for volumetric choice
3.2 Willingness to buy (WTB)
3.2.1 WTB for discrete choice
3.2.2 WTB for volumetric choice
3.3 Economic price premium (EPP)
4 Considerations in conjoint study design
4.1 Demographic and screening questions
4.2 Behavioral correlates
4.3 Establishing representativeness
4.4 Glossary
4.5 Choice tasks
4.6 Timing data
4.7 Sample size
5 Practices that compromise statistical and economic validity
5.1 Statistical validity
5.1.1 Consistency
5.1.2 Using improper procedures to impose constraints on partworths
5.2 Economic validity
5.2.1 Non-economic conjoint specifications
5.2.2 Self-explicated conjoint
5.2.3 Comparing raw part-worths across respondents
5.2.4 Combining conjoint with other data
6 Comparing conjoint and transaction data
6.1 Preference estimates
6.2 Marketplace predictions
6.3 Comparison of willingness-to-pay (WTP)
7 Concluding remarks
Technical appendix: Computing expected demand for volumetric conjoint
4 Empirical search and consideration sets
1 Introduction
2 Theoretical framework
2.1 Set-up
2.2 Search method
2.2.1 Simultaneous search
2.2.2 Sequential search
2.2.3 Discussion
3 Early empirical literature
3.1 Consideration set literature
3.1.1 Early 1990s
3.1.2 Late 1990s and 2000s
3.1.3 2010s - present
3.1.4 Identification of unobserved consideration sets
3.2 Consumer search literature
3.2.1 Estimation of search costs for homogeneous products
3.2.2 Estimation of search costs for vertically differentiated products
4 Recent advances: Search and consideration sets
4.1 Searching for prices
4.1.1 Mehta et al. (2003)
4.1.2 Honka (2014)
4.1.3 Discussion
4.1.4 De los Santos et al. (2012)
4.1.5 Discussion
4.1.6 Honka and Chintagunta (2017)
4.2 Searching for match values
4.2.1 Kim et al. (2010) and Kim et al. (2017)
4.2.2 Moraga-González et al. (2018)
4.2.3 Other papers
5 Testing between search methods
5.1 De los Santos et al. (2012)
5.2 Honka and Chintagunta (2017)
6 Current directions
6.1 Search and learning
6.2 Search for multiple attributes
6.3 Advertising and search
6.4 Search and rankings
6.5 Information provision
6.6 Granular search data
6.7 Search duration
6.8 Dynamic search
7 Conclusions
5 Digital marketing
1 Reduction in consumer search costs and marketing
1.1 Pricing: Are prices and price dispersion lower online?
1.2 Placement: How do low search costs affect channel relationships?
1.3 Product: How do low search costs affect product assortment?
1.4 Promotion: How do low search costs affect advertising?
2 The replication costs of digital goods is zero
2.1 Pricing: How can non-rival digital goods be priced profitably?
2.2 Placement: How do digital channels - some of which are illegal - affect the ability of information good producers to distribute profitably?
2.3 Product: What are the motivations for providing digital products given their non-excludability?
2.4 Promotion: What is the role of aggregators in promoting digital goods?
3 Lower transportation costs
3.1 Placement: Does channel structure still matter if transportation costs are near zero?
3.2 Product: How do low transportation costs affect product variety?
3.3 Pricing: Does pricing flexibility increase because transportation costs are near zero?
3.4 Promotion: What is the role of location in online promotion?
4 Lower tracking costs
4.1 Promotion: How do low tracking costs affect advertising?
4.2 Pricing: Do lower tracking costs enable novel forms of price discrimination?
4.3 Product: How do markets where the customer's data is the `product' lead to privacy concerns?
4.4 Placement: How do lower tracking costs affect channel management?
5 Reduction in verification costs
5.1 Pricing: How willingness to pay is bolstered by reputation mechanisms
5.2 Product: Is a product's `rating' now an integral product feature?
5.3 Placement: How can channels reduce reputation system failures?
5.4 Promotion: Can verification lead to discrimination in how goods are promoted?
6 Conclusions
6 The economics of brands and branding
1 Introduction
2 Brand equity and consumer demand
2.1 Consumer brand equity as a product characteristic
2.2 Brand awareness, consideration, and consumer search
2.2.1 The consumer psychology view on awareness, consideration, and brand choice
2.2.2 Integrating awareness and consideration into the demand model
2.2.3 An econometric specification
2.2.4 Consideration and brand valuation
3 Consumer brand loyalty
3.1 A general model of brand loyalty
3.2 Evidence of brand choice inertia
3.3 Brand choice inertia, switching costs, and loyalty
3.4 Learning from experience
3.5 Brand advertising goodwill
4 Brand value to firms
4.1 Brands and market structure
4.2 Measuring brand value
4.2.1 Reduced-form approaches using price and revenue premia
4.2.2 Structural models
5 Branding and firm strategy
5.1 Brand as a product characteristic
5.2 Brands and reputation
5.3 Branding as a signal
5.4 Umbrella branding
5.4.1 Empirical evidence
5.4.2 Umbrella branding and reputation
5.4.3 Umbrella branding and product quality signaling
5.5 Brand loyalty and equilibrium pricing
5.6 Brand loyalty and early-mover advantage
6 Conclusions
7 Diffusion and pricing over the product life cycle
1 Introduction
Three waves
Implication for the PLC: A new perspective
An agenda for further research
2 The first wave: Models of new product diffusion as way to capture the PLC
2.1 Models of "external" influence
2.2 Models of "internal" influence
2.3 Bass's model
2.4 What was missing in the first wave?
3 The second wave: Life cycle pricing with diffusion models
3.1 Price paths under separable diffusion specifications
3.2 Price paths under market potential specifications
3.3 Extensions to individual-level models
3.4 Discussion
3.5 What was missing in the second wave?
Consumer expectations
Open vs. closed-loop strategies
4 The third wave: Life cycle pricing from micro-foundations of dynamic demand
4.1 Dynamic life-cycle pricing problem overview
4.2 Monopoly problem
Consumer's inter-temporal choice problem
Evolution of states
Flow of profits and value function
4.3 Oligopoly problem
Consumer's inter-temporal choice problem
Evolution of states
Putting both together
Flow of profits and value function
4.4 Discussion
Inferring demand and cost parameters
Handling expectations
Discount factors
Large state spaces
4.5 Additional considerations related to durability
4.5.1 Commitments via binding contracts
4.5.2 Availability and deadlines
4.5.3 Second-hand markets
4.5.4 Renting and leasing
4.5.5 Complementary goods and network effects
4.6 Summary
5 Goods with repeat purchase
5.1 Theoretical motivations
5.2 Empirical dynamic pricing
5.2.1 State dependent utility
5.2.2 Storable goods
5.2.3 Consumer learning
5.3 Summary
6 Open areas where more work will be welcome
6.1 Life-cycle pricing while learning an unknown demand curve
6.2 Joint price and advertising over the life-cycle
6.3 Product introduction and exit
6.4 Long term impact of marketing strategies on behavior
6.5 Linking to micro-foundations
8 Selling and sales management
1 Selling, marketing, and economics
1.1 Selling and the economy
1.2 What exactly is selling?
1.3 Isn't selling the same as advertising?
1.4 The role of selling in economic models
1.5 What this chapter is and is not
1.6 Organization of the chapter
2 Selling effort
2.1 Characterizing selling effort
2.1.1 Selling effort is a decision variable
2.1.2 Selling effort is unobserved
2.1.3 Selling effort is multidimensional
2.1.4 Selling effort has dynamic implications
2.1.5 Selling effort interacts with other firm decisions
3 Estimating demand using proxies for effort
3.1 Salesforce size as effort
3.1.1 Recruitment as selling: Prospecting for customers
3.1.2 Discussion: Salesforce size and effort
3.2 Calls, visits, and detailing as selling effort
3.2.1 Does detailing work?
3.2.2 How does detailing work?
3.2.3 Is detailing = effort?
4 Models of effort
4.1 Effort and compensation
4.2 Effort and nonlinear contracts
4.3 Structural models
4.3.1 Effort and demand
4.3.2 The supply of effort
4.4 Remarks
5 Selling and marketing
5.1 Product
5.2 Pricing
5.3 Advertising and promotions
6 Topics in salesforce management
6.1 Understanding salespeople
6.2 Organizing the salesforce
6.2.1 Territory decisions
6.2.2 Salesforce structure
6.2.3 Decision rights
6.3 Compensating and motivating the salesforce
6.3.1 Contract elements
6.3.2 Contract shape and form
6.3.3 Dynamics
6.3.4 Other issues
7 Some other thoughts
7.1 Regulation and selling
7.2 Selling in the new world
7.3 Concluding remarks
9 How price promotions work: A review of practice and theory
1 Introduction
2 Theories of price promotion
2.1 Macroeconomics
2.2 Price discrimination
2.2.1 Inter-temporal price discrimination
2.2.2 Retail competition and inter-store price discrimination
2.2.3 Manufacturer (brand) competition and inter-brand price discrimination
2.3 Demand uncertainty and price promotions
2.4 Consumer stockpiling of inventory
2.5 Habit formation: Buying on promotion
2.6 Retail market power
2.7 Discussion
3 The practice of price promotion
3.1 Overview of trade promotion process
3.2 Empirical example of trade rates
3.3 Forms of trade spend
3.3.1 Off-invoice allowances
3.3.2 Bill backs
3.3.3 Scan backs
3.3.4 Advertising and display allowances
3.3.5 Markdown funds
3.3.6 Bracket pricing, or volume discounts
3.3.7 Payment terms
3.3.8 Unsaleables allowance
3.3.9 Efficiency programs
3.3.10 Slotting allowances
3.3.11 Rack share
3.3.12 Price protection
3.4 Some implications of trade promotions
3.5 Trade promotion trends
3.6 Planning and tracking: Trade promotion management systems
4 Empirical literature on price promotions
4.1 Empirical research - an update
4.1.1 Promotional pass-through
4.1.2 Long-term effects of promotion
4.1.3 Asymmetric cross-promotional effects
4.1.4 Decomposition of promotional sales
4.1.5 Advertised promotions result in increased store traffic
4.1.6 Trough after the deal
4.2 Empirical research - newer topics
4.2.1 Price promotions and category-demand
4.2.2 Cross-category effects and market baskets
4.2.3 Effectiveness of price promotion with display
4.2.4 Coupon promotions
4.2.5 Stockpiling and the timing of promotions
4.2.6 Search and price promotions
4.2.7 Targeted price promotions
4.3 Macroeconomics and price promotions
4.4 Promotion profitability
5 Getting practical
5.1 Budgets and trade promotion adjustments
5.2 Retailer vs. manufacturer goals and issues
5.3 When decisions happen: Promotion timing and adjustments
5.4 Promoted price: Pass-through
5.5 Durable goods price promotion
5.6 Private label price promotions
5.7 Price pass through
6 Summary
10 Marketing and public policy
1 Introduction
2 The impact of academic research on policy
3 Competition policy
3.1 Market definition and structural analysis
3.2 Economic analysis of competitive effects
3.3 A few recent examples
3.3.1 The Aetna-Humana proposed merger
3.3.2 The AT&T-DirecTV merger
3.3.3 Mergers that increase bargaining leverage
3.4 Looking forward
4 Nutrition policy
4.1 Objectives of nutrition policy
4.2 Nutrient taxes
4.2.1 The effects of taxes
4.2.2 Estimating pass-though
4.3 Restrictions to advertising
4.3.1 The mechanisms by which advertising might affect demand
4.3.2 Empirically estimating the impact of advertising
4.4 Labeling
4.5 Looking forward
5 Concluding comments
Back Cover

Citation preview

Handbook of the Economics of Marketing, Volume 1 Edited by

Jean-Pierre Dubé Sigmund E. Edelstone Professor of Marketing University of Chicago Booth School of Business and N.B.E.R. Chicago, IL, United States

Peter E. Rossi Anderson School of Management University of California, Los Angeles Los Angeles, CA, United States

North-Holland is an imprint of Elsevier Radarweg 29, PO Box 211, 1000 AE Amsterdam, Netherlands The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom Copyright © 2019 Elsevier B.V. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-444-63759-8 For information on all North-Holland publications visit our website at https://www.elsevier.com/books-and-journals Publisher: Zoe Kruze Acquisition Editor: Jason Mitchell Editorial Project Manager: Shellie Bryant Production Project Manager: James Selvam Designer: Greg Harris Typeset by VTeX

Contributors Greg M. Allenby Fisher College of Business, Ohio State University, Columbus, OH, United States Eric T. Anderson Kellogg School of Management, Northwestern University, Evanston, IL, United States Bart J. Bronnenberg Tilburg School of Economics and Management, Tilburg University, Tilburg, The Netherlands CEPR, London, United Kingdom Jean-Pierre Dubé Booth School of Business, University of Chicago, Chicago, IL, United States NBER, Cambridge, MA, United States Edward J. Fox Cox School of Business, Southern Methodist University, Dallas, TX, United States Avi Goldfarb Rotman School of Management, University of Toronto, Toronto, ON, Canada NBER, Cambridge, MA, United States Rachel Griffith Institute for Fiscal Studies and University of Manchester, Manchester, United Kingdom Nino Hardt Fisher College of Business, Ohio State University, Columbus, OH, United States Elisabeth Honka UCLA Anderson School of Management, Los Angeles, CA, United States Ali Hortaçsu University of Chicago, Chicago, IL, United States NBER, Cambridge, MA, United States Sanjog Misra University of Chicago Booth School of Business, Chicago, IL, United States




Sridhar Moorthy Rotman School of Management, University of Toronto, Toronto, ON, Canada Harikesh S. Nair Stanford Graduate School of Business, Stanford, CA, United States Aviv Nevo University of Pennsylvania, Philadelphia, PA, United States Peter E. Rossi Anderson School of Management, University of California at Los Angeles, Los Angeles, CA, United States Catherine Tucker MIT Sloan School of Management, Cambridge, MA, United States NBER, Cambridge, MA, United States Matthijs Wildenbeest Kelley School of Business, Indiana University, Bloomington, IN, United States

Preface This volume is the first in a new Handbook of the Economics of Marketing series. Quantitative marketing is a much younger field than either economics or statistics. While substantial parts of our understanding of consumer welfare and demand theory were laid out in the late 19th and early 20th centuries, serious inquiries into models of consumer behavior in marketing started only in the late 1960s. However, it was really during the past 25–30 years that the access to remarkably detailed, granular customer and seller-level databases generated a take-off in the quantitative marketing literature. The increasing focus by the fields of empirical industrial organization (I/O) and macroeconomics on several of the key themes in marketing has highlighted the central role of marketing institutions to economic outcomes. The purpose of this handbook is both to chronicle the progress in marketing research as well as to introduce researchers in economics to the role of marketing in our understanding of consumer and firm behavior, and to inform public policy. While marketing and economic researchers share many of the common tools of micro-economics and econometrics, there is a fundamental distinction between the aims of the two disciplines. Most research in economics should be viewed as positive economics, namely the pursuit of explanations for given marketing phenomena such as the determinants of market structure or the pricing equilibrium prevailing is a given market. On the other hand, marketing is primarily concerned with the evaluation of firm polices and, as such, is much more of a normative field. For example, a marketing researcher may use the tools of micro-economics to develop models of consumer behavior (demand) but does impose the restriction that firms, necessarily, behave optimally with respect to a given set of marketing instruments and information. For example, as detailed customer-level data became available, marketing research has focused on how to use these data to develop customized advertising and promotion of products. Marketing researchers are loath to assume that firms behave optimally with respect to the use of a new source of information. In the first chapter of this volume, Dubé considers the micro-foundations of demand with an emphasis on the challenges created by the much richer and more dis-aggregate data that are typically available in marketing applications. Researchers in economics have long fit models of aggregate demand and the modern I/O literature emphasizes consistent models of aggregation of individual demands. In marketing, the demand literature has focused on individual, consumer-level drivers of demand starting with the access to consumption diary panels during the late 1950s. With the advent of detailed household purchase panels during the early 1980s, there was a take-off in microeconometric studies using the recent developments in random utility based choice models. However, dis-aggregate demand data present many challenges in demand modeling that stem from discreteness in this data. A substantial component of the literature seeks to accommodate aspects of discreteness which cannot be




accommodated by the multinomial models that have been popular in I/O. These aspects include corner solutions with a mixture of discrete and continuous outcomes and non-mutually exclusive discrete outcomes. Marketing researchers were also the first to document the importance of unobservable consumer heterogeneity and point out that this heterogeneity is pervasive and affects just a subset of variables, typically assumed in the empirical I/O literature. Finally, the marketing literature pioneered the study of dynamic consumer choice behavior and its implications for differences between short-run and long-run elasticities of demand. The demands of a world with a high-volume of disaggregate data on inference is discussed by Allenby and Rossi in Chapter 2. In addition, this chapter considers the demands that a normative orientation imposes on inference. Marketing researchers were early to embrace Bayesian approaches to inference to a degree still not matched in economics. The current strong interest in machine learning methods in economics is a partial endorsement of Bayesian methods, since these highly over-parameterized models are often fit with Bayesian or approximate Bayesian methods because of the superior estimation properties of Bayesian methods. Bayesian methods have made been adopted in marketing primarily because of the practical orientation of marketing researchers. Researchers in marketing are looking for methods that can work well rather than simply debating the value of an inference paradigm in the abstract. While discrete data on demand has proliferated, challenges to valid causal inference have also arisen or become accentuated. The classic example of this is a sponsored search advertisement. Here information regarding preferences or interest in a product category is used to trigger the sponsored search ad. Clearly, observational data suffers from a severe endogeneity bias as advertisers are selecting explicitly on unobservable preferences (typically interest in a product category). This poses a fundamental inference challenge as all optimality or evaluation of advertising polices requires valid causal inference. Experimentation can provide one approach to obtaining valid causal estimates but comes with other limitations and challenges as discussed in the chapter. Most economists take the point of view that revealed preference data is to be preferred to stated preference data. While certainly a reasonable point of view, it is somewhat narrow. If we can approximate the purchase environment in a survey context, it is possible to generate data that may rise to the value of revealed preference data. In addition, it is often not possible to find revealed preference data that are sufficiently informative to estimate preferences. For example, many observational datasets have very limited price variation and are subject to legitimate concerns about the endogeneity of marketing variables, such as prices. Clearly, prospective analyses of new products or new product features lack revealed preference data. Unique to marketing is the survey method of preference measurement called conjoint analysis. In Chapter 3, Allenby, Rossi, and Hardt discuss the economic foundations of conjoint analysis as well as the extension to consider both discrete and continuous outcomes. In much of the economics and marketing literatures, demand is formulated under some sort of full information assumption. A classic example is the assumption in nearly all demand applications that prices are known with certainty. If, instead,


consumers undertake costly search for price or quality information, then these search frictions must be accommodated in the demand model. Here economics and marketing intersect very closely as marketers have always recognized that consumers make choices based on their “consideration sets,” which typically include only a small subset of the available products in any given market or product category. While theoretical implications of sequential and simultaneous search have been worked out in the economics literature, the empirical search literature has only recently taken flight due to a lack of data on the search process. In the online context, the data problem has been removed as browsing data gives us our first comprehensive measure of search. In the off-line context, data are still hard to come by. In Chapter 4, Honka, Hortaçsu, and Wildenbeest discuss the recent developments in this important area of mutual interest for both marketing and economics empirical research. In many markets, digital media and technologies have reduced the cost of search dramatically. For example, consumers can obtain a wealth of information regarding car purchases from internet sources without visiting any dealer. In parallel, digital technologies threaten the value of stores as a logistical intermediary, serving as a fundamental source of change in many industries. Goldfarb and Tucker discuss these trends and provide insight as to their implications for marketing practice and research. In addition, digital media have fundamentally changed the way in which much advertising is conducted. Advertisers now have the ability to trigger ads as well as to track, at individual level, the response to these ads. This opens many new possibilities for marketing policy. Much of modern I/O has concentrated on the economic mechanisms that sustain high levels of industry concentration and supra-normal economic oligopoly profits. One of the leading puzzles in this literature has been the persistence of concentration and dominance in markets where the leading products are differentiated primarily by their brands. Surprisingly, even the literature on pure characteristics models has typically ignored the important role of brands and branding as a source of product differentiation. In Chapter 6, Bronnenberg, Dubé, and Moorthy discuss branding which can be viewed as one of the more important sources of product differentiation. Marketers have long recognized the importance of brands and considered various mechanisms through which brand preferences are developed on the demand side. The primary source of economic value to many firms is their brand equity and, accordingly, the authors also consider the returns on branding to firm value on the supply side. At least since the 1960s, quantitative marketing has been interested in the lifecycle regularities of products as they diffuse from launch to maturity. The diffusion typically focuses on the dynamics of consumer adoption. While the early literature worked mostly with descriptive models, the recent literature has adopted a structural approach that starts with microeconomic foundations to analyze the consumer-driven mechanisms that shape a product’s diffusion over time. In Chapter 7, Nair considers the intersection of the economics and marketing literatures regarding the dynamics of consumer adoption decisions, on the demand side. The chapter also studies the corresponding supply-side dynamics of firms’ marketing decisions (e.g., entry, pricing, advertising) to control the diffusion of their products over time. Here the combination




of data and empirical methods for the estimation of dynamic models has provided a new empirical companion to the theoretical results in the economics literature. In many important markets, the primary interface between the firm and the customer is the salesforce. The activities of the salesforce represent the single largest marketing expenditure for many firms. In markets for drugs and medical devices, these activities have come under public scrutiny as potentially wasteful or even distorting of demand. The role of the salesforce is considered in Chapter 8. Misra reviews the contracting problem which is the primary contribution of the economics literature to this area. Recent empirical work in sales force compensation has emphasized dynamic considerations not present in the classical economics contracting literature. Over the last few decades, US manufacturers have increasingly allocated more of their marketing budgets away from traditional advertising toward price promotions, typically paid to downstream trade partners who handle the re-sale to end-user consumers. In consumer goods alone, annual trade promotion spending is estimated to be over $500 billion. Trade promotion funds are intended to induce temporary discounts on shelf prices, or sales, that represent one of the key sources of price variation over time in many consumer goods markets. Surprisingly, many theoretical models of pricing ignore the institutional structure of the distribution channel that leads to temporary, promotional price cuts, thereby ignoring a key source of price changes facing consumers. In Chapter 9, Anderson and Fox offer a broad overview of the managerial and institutional trade promotion practices of manufacturers and retailers. They also survey the large literature on pricing, illustrating the gaps in the theory and the opportunities for future research. The discussion ties together the key factors driving promotions, including the vertical channel itself, price discrimination, vertical and horizontal competition in the channel, and consumer’s ability to stockpile goods. Marketing considerations have become vital in many analyses of public policy issues. For example, there are debates about various public policy initiatives to promote the consumption of healthier foods. One possibility endorsed by many public policy makers is to provide information on the nutritional content, especially of processed and fast foods. Another method advocated by some is to impose “vice” taxes on food products which are deemed undesirable. In Chapter 10, Nevo and Griffith point out that the evaluation of these policies involve demand modeling and consideration of the response of firms to these policies. Nutrition policy is only one example of policy evaluation where marketing considerations are important. Analysis of mergers is now based on models of demand for differentiated products pioneered in both the economics and marketing literatures. Many other public policy questions hinge on the promotion, sales, pricing, and advertising decisions of firms and, therefore, provide a strong motivation for continued progress in research at the intersection of economics and marketing. Jean-Pierre Dubé Sigmund E. Edelstone Professor of Marketing University of Chicago Booth School of Business and N.B.E.R. Chicago, IL, United States


Peter E. Rossi Anderson School of Management University of California, Los Angeles Los Angeles, CA, United States




Microeconometric models of consumer demand✩

Jean-Pierre Dubéa,b a Booth

School of Business, University of Chicago, Chicago, IL, United States b NBER, Cambridge, MA, United States e-mail address: [email protected]

Contents 1 Introduction ...................................................................................... 2 Empirical regularities in shopping behavior: The CPG laboratory....................... 3 The neoclassical derivation of an empirical model of individual consumer demand 3.1 The neoclassical model of demand with binding, non-negativity constraints........................................................................... 3.1.1 Estimation challenges with the neoclassical model ........................ 3.1.2 Example: Quadratic utility....................................................... 3.1.3 Example: Linear expenditure system (LES).................................. 3.1.4 Example: Translated CES utility ................................................ 3.1.5 Virtual prices and the dual approach ......................................... 3.1.6 Example: Indirect translog utility............................................... 3.2 The discrete/continuous product choice restriction in the neoclassical model................................................................................. 3.2.1 The primal problem .............................................................. 3.2.2 Example: Translated CES utility ................................................ 3.2.3 Example: The dual problem with indirect translog utility ................. 3.2.4 Promotion response: Empirical findings using the discrete/continuous demand model .......................................... 3.3 Indivisibility and the pure discrete choice restriction in the neoclassical model................................................................................. 3.3.1 A neoclassical derivation of the pure discrete choice model of demand............................................................................. 3.3.2 The standard pure discrete choice model of demand .................... 4 Some extensions to the typical neoclassical specifications ............................. 4.1 Income effects ...................................................................... 4.1.1 A non-homothetic discrete choice model.................................... 4.2 Complementary goods .............................................................

2 6 8 8 11 12 14 15 17 19 20 20 22 23 24 25 25 28 31 31 32 33

✩ Dubé acknowledges the support of the Kilts Center for Marketing and the Charles E. Merrill faculty

research fund for research support. I would like to thank Greg Allenby, Shirsho Biswas, Oeystein Daljord, Stefan Hoderlein, Joonhwi Joo, Kyeongbae Kim, Yewon Kim, Nitin Mehta, Olivia Natan, Peter E. Rossi, and Robert Sanders for helpful comments and suggestions. Handbook of the Economics of Marketing, Volume 1, ISSN 2452-2619, https://doi.org/10.1016/bs.hem.2019.04.001 Copyright © 2019 Elsevier B.V. All rights reserved.



CHAPTER 1 Microeconometric models of consumer demand

4.2.1 Complementarity between products within a commodity group ........ 4.2.2 Complementarity between commodity groups (multi-category models) 4.3 Discrete package sizes and non-linear pricing ................................. 4.3.1 Expand the choice set ........................................................... 4.3.2 Models of pack size choice ..................................................... 5 Moving beyond the basic neoclassical framework ........................................ 5.1 Stock-piling, purchase incidence, and dynamic behavior .................... 5.1.1 Stock-piling and exogenous consumption ................................... 5.1.2 Stock-piling and endogenous consumption ................................. 5.1.3 Empirical findings with stock-piling models ................................. 5.2 The endogeneity of marketing variables......................................... 5.2.1 Incorporating the supply side: A structural approach ..................... 5.2.2 Incorporating the supply side: A reduced-form approach................ 5.3 Behavioral economics.............................................................. 5.3.1 The fungibility of income ........................................................ 5.3.2 Social preferences................................................................ 6 Conclusions ...................................................................................... References............................................................................................

35 37 41 42 43 44 44 45 47 50 50 53 55 57 57 58 61 62

1 Introduction A long literature in quantitative marketing has used the structural form of microeconometric models of demand to analyze consumer-level purchase data and conduct inference on consumer behavior. These models have played a central role in the study of some of the key marketing questions, including the measurement of brand preferences and consumer tastes for variety, the quantification of promotional response, the analysis of the launch of new products and the design of targeted marketing strategies. In sum, the structural form of the model is critical for measuring unobserved economic aspects of consumer preferences and for simulating counter-factual marketing policies that are not observed in the data (e.g. demand for a new product that has yet to be launched). The empirical analysis of consumer behavior is perhaps one of the key areas of overlap between the economics and marketing literatures. The application of empirical models of aggregate demand using restrictions from microeconomics dates back at least since the mid-20th century (e.g., Stone, 1954). Demand estimation plays a central role in marketing decision-making. Marketing-mix models, or models of demand that account for the causal effects of marketing decision variables, such as price, promotions, and other marketing tools, are fundamental for the quantification of different marketing decisions. Examples include the measurement of market power, the measurement of sales-response to advertising, the analysis of new product introductions, and the measurement of consumer welfare, just as a few examples. Historically, the data used for demand estimation typically consisted of marketlevel, aggregate sales quantities under different marketing conditions. In the digital age, the access to transaction-level data at the point of sale has become nearly ubiqui-

1 Introduction

tous. In many settings, firms can now assemble detailed, longitudinal databases tracking individual customers’ purchase behavior over time and across channels. Simply aggregating these data for the purposes of applying traditional aggregate demand estimation techniques creates several problems. First, aggregation destroys potentially valuable information about customer behavior. Besides the loss of potential statistical efficiency, aggregation eliminates the potential for individualized demand analysis. Innovative selling technologies have facilitated a more segmented and even individualized approach to marketing, requiring a more intimate understanding of the differences in demand behavior between customer segments or even between individuals. Second, aggregating individual demands across customers facing heterogeneous marketing conditions can create biases that could have adverse effects on marketing decision-making (e.g., Gupta et al., 1996).1 Our discussion herein focuses on microeconometric models of demand designed for the analysis of individual consumer-level data. The microeconomic foundations of a demand model allow the analyst to assign a structural interpretation to the model’s parameters, which can be beneficial for assessing “consumer value creation” and for conducting counter-factual analyses. In addition, as we discuss herein, the crossequation restrictions derived from consumer theory can facilitate more parsimonious empirical specifications of demand. Finally, the structural foundation of the econometric uncertainty as a model primitive provides a direct correspondence between the likelihood function and the underlying microeconomic theory. Some of the earliest applications of microeconometric models to marketing data analyzed the decomposition of consumer responses to temporary promotions at the point of purchase (e.g., Guadagni and Little, 1983; Chiang, 1991; Chintagunta, 1993). Of particular interest was the relative extent to which temporary price discounts caused consumers to switch brands, increase consumption, or strategically forward-buy to stockpile during periods of low prices. The microeconometric approach provided a parsimonious, integrated framework to with which understand the inter-relationship between these decisions and consumer preferences. At the same time, the cross-equation restrictions from consumer theory can reduce the degrees-offreedom in an underlying statistical model used to predict these various components of demand. Most of the foundational work derives from the consumption literature (see Deaton and Muellbauer, 1980b, for an extensive overview). The consumption literature often emphasizes the use of cost functions and duality concepts to simplify the implementation of the restrictions from economic theory. In this survey, we mostly focus on the more familiar direct utility maximization problem. The use of a parametric utility function facilitates the application of demand estimates to broader topics than the analysis of price and income effects, such as product quality choice, consumption indivisibilities, product positioning, and product design.

1 Blundell et al. (1993) find that models fit to aggregate data generate systematic biases relative to models

fit to household-level data, especially in the measurement of income effects.



CHAPTER 1 Microeconometric models of consumer demand

In addition, our discussion focuses on a very granular, product-level analysis within a product category.2 Unlike the macro-consumption literature, which focuses on budget allocations across broad commodity groups like food, leisure, and transportation, we focus on consumer’s brand choices within a narrow commodity group, such as the brand variants and quantities of specific laundry detergents or breakfast cereals purchased on a given shopping trip. The role of brands and branding have been shown to be central to the formation of industrial market structure (e.g., Bronnenberg et al., 2005; Bronnenberg and Dubé, 2017). To highlight some of the differences between a broad commodity-group focus versus a granular brand-level focus, we begin the chapter with a short descriptive exercise laying out several key stylized facts for households’ shopping behavior in consumer packaged goods (hereafter CPG) product categories using the Nielsen-Kilts Homescan database. We find that the typical consumer goods category offers a wide array of differentiated product alternatives available for sale to consumers, often at different prices and under different marketing conditions. Therefore, consumer behavior involves a complex trade-off between the prices of different goods and their respective perceived qualities. Moreover, an individual household typically purchases only a limited scope of the variety available. This purchase behavior leads to the well-known “corner solutions” problem whereby expenditure on most goods is typically zero. Therefore, a satisfactory microeconometric model needs to be able to accommodate a demand system over a wide array of differentiated offerings and a high incidence of corner solutions. The remainder of the chapter surveys the neoclassical, microeconomic foundations of models of individual demand that allow for corner solutions.3 From an econometric perspective, non-purchase behavior contains valuable information about consumers’ preferences and the application of econometric models that impose strictly interior solutions would likely produce biased and inconsistent estimates of demand – a selection bias. However, models with corner solutions introduce a number of complicated computational challenges, including high-dimensional integration over truncated distributions and the evaluation of potentially complicated Jacobian matrices. The challenges associated with corner solutions have been recognized at least since Houthaker (1953) and Houthakker (1961) who discuss them as a special case of quantity rationing. We also discuss the role of discreteness both in the brand variants and quantities purchased. In particular, we explore the relationship between the popular discrete choice models of demand (e.g., logit) and the more general neoclassical models.

2 We refer readers interested in a discussion of aggregation and the derivation of models designed for estimation with market-level data to the surveys by Deaton and Muellbauer (1980b), Nevo (2011), and Pakes (2014). 3 See also the following for surveys of microeconometric models: Nair and Chintagunta (2011) for marketing, Phaneuf and Smith (2005) for environmental economics, and Deaton and Muellbauer (1980b) for the consumption literature.

1 Introduction

In a follow-up section, we discuss several important extensions of the empirical specifications used in practice. We discuss the role of income effects. For analytic tractability, many popular specifications impose homotheticity and quasi-linearity conditions that limit or eliminate income effects. We discuss non-homothetic versions of the classic discrete choice models that allow for more realistic asymmetric substitution patterns between vertically-differentiated goods. Another common restriction used in the literature is additive separability both across commodity groups and across the specific products available within a commodity group. This additivity implies that all products are gross substitutes, eliminating any scope for complementarity across goods. We discuss recent research that has analyzed settings with complementary goods. In many consumer goods categories, firms use complex non-linear pricing strategies that restrict the quantities a consumer can purchase to a small set of pre-packaged commodity bundles. We do not discuss the price discrimination itself, focusing instead on the indivisibility the commodity bundling imposes on demand behavior. In the final section of the survey, we discuss several important departures from the standard neoclassical framework. While most of the literature has focused on static models of brand choices, the timing of purchases can play an important role in understanding the impact of price promotions on demand. We discuss dynamic extensions that allow consumers to stock-pile storable goods based on their price expectations. The accommodation of purchase timing can lead to very different inferences about the price elasticity of demand. We also discuss the potential role of the supply side of the market and the resulting endogeneity biases associated with the strategic manner in which point-of-purchase marketing variables are determined by firms. Most of the literature on microeconometric models of demand has ignored such potential endogeneity in marketing variables. Finally, we address the emerging area of structural models of behavioral economics that challenge some of the basic elements of the neoclassical framework. We discuss recent evidence of mental accounting in the income effect that creates a surprising non-fungibility across different sources of purchasing power. We also discuss the role of social preferences and models of consumer-response to cause marketing campaigns. Several important additional extensions are covered in later chapters of this volume, including the role of consumer search, information acquisition and the formation of consideration sets (Chapter 4), the role of brands and branding (Chapter 6), and the role of durable goods and the timing of consumer adoption throughout the product life cycle (Chapter 7). Perhaps the most crucial omission herein is the discussion of taste heterogeneity, which is covered in depth in Chapter 2 of this volume. Consumer heterogeneity plays a central role in the literature on targeted marketing.



CHAPTER 1 Microeconometric models of consumer demand

2 Empirical regularities in shopping behavior: The CPG laboratory In this section, we document broad patterns of purchase behavior across US households in the consumer packaged goods (CPG) industry. We will use these shopping patterns in Section 3 as the basis for deriving a microeconometric demand estimation framework derived from neoclassical consumer theory. The CPG industry represents a valuable laboratory in which to study consumer behavior. CPG brands are widely available across store formats including grocery stores, supermarkets, discount and club stores, drug stores, and convenience stores. They are also purchased at a relatively high frequency. The average US household consumer conducted 1.7 grocery trips per week in 2017.4 Most importantly, CPG spending represents an sizable portion of household budgets. In 2014, the global CPG sector was valued at $8 trillion and was predicted to grow to $14 trillion by 2025.5 In 2016, US households spent $407 billion on CPGs.6 A long literature in brand choice has used household panel data in CPG categories not only due to the economic relevance, but also due to the high quality of the data. CPG categories exhibit high-frequency price promotions that can be exploited for demand estimation purposes. We use the Nielsen Homescan panel housed by the Kilts Center for Marketing at the University of Chicago Booth School of Business to document CPG purchase patterns. The Homescan panelists are nationally representative.7 The database tracks purchases in 1,011 CPG product categories (denoted by Nielsen’s product modules codes) for over 132,000 households between 2004 and 2012, representing over 88 million shopping trips. Nielsen classifies product categories using module codes. Examples of product modules include Carbonated Soft Drinks, Ready-to-Eat Cereals, Laundry Detergents, and Tooth Paste. We retain the 2012 transaction data to document several empirical regularities in shopping behavior. In 2012, we observe 52,093 households making over 6.57 million shopping trips during which they purchase over 46 million products. The typical CPG category offers a wide amount of variety to consumers. Focusing only on the products actually purchased by Homescan panelists in 2012, the average category offers 402.8 unique products as indexed by a universal product code (UPC) 4 Source: “Consumers’ weekly grocery shopping trips in the United States from 2006 to 2017,” Statista, 2017, accessed at https://www.statista.com/statistics/251728/weekly-number-of-us-grocery-shoppingtrips-per-household/ on 11/13/2017. 5 Source: “Three myths about growth in consumer packaged goods,” by Rogerio Hirose, Renata Maia, Anne Martinez, and Alexander Thiel, McKinsey, June 2015, accessed at https://www.mckinsey.com/ industries/consumer-packaged-goods/our-insights/three-myths-about-growth-in-consumer-packagedgoods on 11/13/2017. 6 Source: “Consumer packaged goods (CPG) expenditure of U.S. consumers from 2014 to 2020,” Statista, 2017, accessed at https://www.statista.com/statistics/318087/consumer-packaged-goods-spending-of-usconsumers/ on 11/13/2017. 7 See Einav et al. (2010) for a validation study of the Homescan data.

2 Empirical regularities in shopping behavior: The CPG laboratory

and 64.4 unique brands. For instance, a brand might be any UPC coded product with the brand name Coca Cola, whereas a UPC might be a 6-pack of 12-oz cans of Coca Cola. While the subset of available brands and sizes varies across stores and regions, these numbers reveal the extent of variety available to consumers. In addition, CPG products are sold in pre-packaged, indivisible pack sizes. The average category offers 31.9 different pack size choices. The average brand within a category is sold in 5.4 different pack sizes. Therefore, consumers face an interesting indivisibility constraint, especially if they are determined to buy a specific brand variant. Moreover, CPG firms’ widespread pre-commitment to specific sizes is suggestive of extensive use of non-linear pricing. For the average category, we observe 39,787 trips involving at least one purchase. Households purchase a single brand and a single pack during 94.3% and 67.3% of the category-trip combinations, respectively. On average, households purchase 1.07 brands per category-trip. In sum, the discrete brand choice assumption, and to a lesser extent the discrete quantity choice assumption, seems broadly appropriate across trips at the category level. However, we do observe categories in which the contemporaneous purchase of assortments is more commonplace and many of these categories are economically large. In the Ready-to-Eat Cereals category, which ranks third overall in total household expenditures among all CPG categories, consumers purchase a single brand during only 72.6% of trips. Similarly, for Carbonated Soft Drinks and Refrigerated Yogurt, which rank fourth and tenth overall respectively, consumers purchase a single brand during only 81.5% and 86.6% of trips respectively. Therefore, case studies of some of the largest CPG categories may need to consider demand models that allow for the purchase of variety, even though only a small number of the variants is chosen on any given trip. Similarly, we observe many categories where consumers occasionally purchase multiple packs of a product, even when only a single brand is chosen. In these cases, a demand model that accounts for the intensive margin of quantity purchased may be necessary. We also observe brand switching across time within a category, especially in some of the larger categories. For instance, during the course of the year, households purchased 7.5 brands of Ready-to-Eat Cereals (ranked 3rd), 5.9 brands of Cookies (ranked 11th), 4.7 brands of Bakery Bread (ranked 7th), and 4.6 brands of Carbonated Soft Drinks (ranked 4th). In many of the categories with more than an average of 3 brands purchased per household-year, we typically observe only one brand being chosen during an individual trip. In summary, a snapshot of a single year of CPG shopping behavior by a representative sample of consumers indicates some striking patterns. In spite of the availability of a large amount of variety, an individual consumer purchases only a very small number of variants during the course of a year, let alone on any given trip. From a modeling perspective, we observe a high incidence of corner solutions. In some of the largest product categories, consumers routinely purchase assortments, leading to complex patterns of corner solutions. In most categories, the corner solutions degenerate to a pure discrete choice scenario where a single unit of a single product is purchased. In these cases, the standard discrete choice models may be sufficient.



CHAPTER 1 Microeconometric models of consumer demand

However, the single unit is typically one of several pre-determined pack sizes available suggesting an important role for indivisibility on the demand side, and non-linear pricing on the supply side.

3 The neoclassical derivation of an empirical model of individual consumer demand The empirical regularities in Section 2 show that household-level demand for consumer goods exhibit a high incidence of corner solutions: purchase occasions with zero expenditure on most items in the choice set. The methods developed in the traditional literature on demand estimation (e.g., Deaton and Muellbauer, 1980b) do not accommodate zero consumption. In this section, we review the formulation of the neoclassical consumer demand problem and the corresponding challenges with the accommodation of corner solutions into an empirical framework. Our theoretical starting point is the usual static model of utility maximization whereby the consumer spends a fixed budget on a set of competing goods. Utility theory plays a particularly crucial role in accommodating the empirical prominence of corner solutions in individual-level data.

3.1 The neoclassical model of demand with binding, non-negativity constraints We start with the premise that the analyst has access to marketing data comprising individual-level transactions. The analyst’s data include the  exact vector  of quantities purchased by a customer on a given shopping trip, xˆ = xˆ1 , ..., xˆJ +1 . An individual transaction database typically has a panel format with time-series observations (trips) for a cross section of customers. We assume that the point-of-purchase causal environment consists of prices, but the database could also include other marketing promotional variables. Our objective consists of deriving a likelihood for this observed vector of purchases from microeconomic primitives. Suppose WLOG that the consumer does not consume the first l goods: xˆj = 0 (j = 1, ..., l), and xˆj > 0 (j = l + 1, ..., J + 1). We use the neoclassical approach to deriving consumer demand from the assumption that each consumer maximizes a utility function U (x; θ, ε) defined over the quantity of goods consumed, x = (x1 , ..., xJ +1 ) . Since most marketing studies focus on demand behavior within a specific “product category,” we adopt the terminology of Becker (1965) and distinguish between the “commodity” (e.g., the consumption benefit of the category, such as laundry detergent), and the j = 1, ..., J “market goods” to which we will refer as “products” (e.g., the various brands sold within the product category, such as Tide and Wisk laundry detergents). The quantities in x are non-negative (xj ≥ 0 ∀j ) and satisfy the consumer’s budget constraint x p ≤ y, where p = (p1 , ..., pJ +1 ) is a vector of strictly positive prices and y is the consumer’s budget. The vector θ consists of unknown (to the researcher) pa-

3 The neoclassical derivation of an empirical model

rameters describing the consumer’s underlying preferences and the vector ε captures unobserved (to the researcher), mean-zero, consumer-specific utility disturbances.8 Typically ε is assumed to be known to the consumer prior to decision-making. Formally, the utility maximization problem can be written as follows V (p, y; θ, ε) ≡ max

x∈RJ +1

U (x; θ, ε) : x p ≤ y, x ≥ 0


where we assume U (•; θ, ε) is a continuously-differentiable, quasi-concave, and 9 We can define the corresponding Lagrangian function L = increasing function.   U (x; θ, ε) + λy y − p x + λx x where λy and the vector λx are Lagrange multipliers for the budget and non-negativity constraints respectively. A solution to (1) exists as long as the following necessary and sufficient KarushKuhn-Tucker (KKT) conditions hold   ∂U x∗ ;θ,ε ∂xj

− λy pj + λx,j = 0,

j = 1, ..., J + 1

  y − p x∗ = 0, y − p x∗ λy = 0, λy > 0 xj∗ ≥ 0, xj∗ λx,j = 0, λx,j ≥ 0

(2) j = 1, ..., J + 1.

Since U (•; θ, ε) is increasing, the consumer spends her entire budget (the “addingup” condition) and at least one good will always be consumed. We define the J + 1 good as an “essential” numeraire with corresponding price pJ +1 = 1 and with preferences that are separable from those over the commodity group.10 We assume additional regularity conditions  on U (•; θ, ε) to ensure that an interior quantity of J + 1 ∂U x∗ ;θ,ε

is always consumed: ∂xJ +1 = λy and λx,J +1 = 0. Therefore, the model can accommodate the case where only the outside good is purchased and none of the inside goods are chosen. We can now re-write the KKT conditions as follows   ∂U x∗ ;θ,ε ∂xj

  ∂U x∗ ;θ,ε ∂xJ +1 pj

+ λx,j = 0,

j = 1, ..., J

y − p  x∗ = 0 xj∗ ≥ 0, xj∗ λx,j = 0, λx,j ≥ 0

(3) j = 1, ..., J.

8 It is straightforward to allow for additional persistent, unobserved taste heterogeneity by indexing the parameters themselves by consumer (see Chapter 2 of this volume). 9 These sufficient conditions ensure the existence of a demand function with a unique consumption level that maximizes utility at a given set of prices (e.g., Mas-Collel et al., 1995, Chapter 3). 10 The essential numeraire is typically interpreted as expenditures outside of the commodity group(s) of interest.



CHAPTER 1 Microeconometric models of consumer demand

Demand estimation consists of devising an estimator for the parameters θ based on the solution to the system (3), x∗ (p, y; θ, ε). For our observed consumer, recall that xˆj = 0 (j = 1, ..., l) and xˆj > 0 (j = l + 1, ..., J + 1). We can now re-write the KKT conditions to account for the corner solutions (i.e., non-consumption)   ∂U x∗ ;θ,ε ∂xj   ∂U x∗ ;θ,ε ∂xj

  ∂U x∗ ;θ,ε ∂xJ +1 pj

  ∂U x∗ ;θ,ε ∂xJ +1 pj

≤ 0,

j = 1, ..., l (4)

= 0,

j = l + 1, ..., J

It is instructive to consider how the KKT conditions (4) influence demand estimation. The l + 1 to J equality conditions in (4) implicitly characterize the conditional demand equations for the purchased goods. The l inequality conditions in (4) give rise to the following demand regime-switching conditions, or “selection” conditions   ∂U x∗ ;θ,ε ∂xj ∂U (x∗ ;θ,ε) ∂xJ +1

≤ pj , j = 1, ..., l


which determine whether a given product’s prices are above the consumer’s reser vation value,

∂U x∗ ;θ,ε ∂x  j  ∂U x∗ ;θ,ε ∂xJ +1

(see Lee and Pitt, 1986; Ransom, 1987, for a discussion of

the switching regression interpretation). We can now see how dropping the observations with zero consumption will likely result in selection bias due to the correlation between the switching probabilities and the utility shocks, ε. To complete the model, we need to allow for some separability of the utility disturbances.   instance, we can assume an additive, stochastic log-marginal utility:   ∗ For   ∂U x ;θ,ε ln = ln U¯ j (x∗ ; θ ) + εj for each j , where U¯ j (x∗ ; θ ) is deterministic. ∂xj

We also assume that ε are random variables with known distribution and density, Fε (ε) and fε (ε), respectively. We can now write the KKT conditions more compactly: ε˜ j ≡ εj − εJ +1 ≤ hj (x∗ ; θ ) ,

j = 1, ..., l

(6) ε˜ j ≡ εj − εJ +1 = hj (x∗ ; θ ) , j = l + 1, ..., J       where hj (x∗ ; θ ) = ln U¯ J +1 (x∗ ; θ ) − ln U¯ j (x∗ ; θ ) + ln pj . We can now derive the likelihood function associated with the observed consumption vector, xˆ . In the case where all the goods are consumed, then the density of xˆ is     fx xˆ ; θ = fε˜ (˜ε ) |J xˆ |


3 The neoclassical derivation of an empirical model

  where J xˆ is the Jacobian of the transformation from ε˜ to x. If only the J + 1 numeraire good is consumed, the density of xˆ = (0, ..., 0) is 

fx xˆ ; θ =

  hJ xˆ ;θ



  h1 xˆ ;θ −∞

fε (ε˜ ) d ε˜ 1 · · · d ε˜ J .


For the  more general case in which the first l goods are not consumed, the density of xˆ = 0, ..., 0, xˆ l+1 , ..., xˆ J is 

  hl xˆ ;θ

fx xˆ ; θ =

  h1 xˆ ;θ

··· −∞

       fε ε˜ 1 , ..., ε˜ l , hl+1 xˆ ; θ , ..., hJ xˆ ; θ |J xˆ |d ε˜ 1 · · · d ε˜ l


(9)   where J xˆ is the Jacobian of the transformation from ε˜ to (xl+1 , ..., xJ ) when (x1 , ..., xl ) = 0. Suppose the researcher has a data sample with i = 1, ..., N independent consumer purchase observations. The sample likelihood is N     fx xˆ i . L θ |ˆx =



A maximum likelihood estimate of θ based on (10) is consistent and asymptotically efficient.

3.1.1 Estimation challenges with the neoclassical model van Soest et al. (1993) have shown that the choice of functional form to approximate utility, U (x), can influence consistency of the maximum likelihood estimator based on (10). In particular, the KKT conditions in (2) generate a unique vector x∗ (p, y; θ, ε) at given (p, y) for all possible θ and ε as long as U (x) is monotonic and strictly quasi-concave. When these conditions fail to hold, the system of KKT conditions (2) may not generate a unique solution, x∗ (p, y; θ, ε). This non-uniqueness leads to the well-known coherency problem with maximum likelihood estimation (Heckman, 1978),11 which can lead to inconsistent estimates. Note that the term coherency is used slightly differently in the more recent literature on empirical games with multiple equilibria. Tamer (2003) uses the term coherency in reference to the sufficient conditions for the existence of a solution x∗ (p, y; θ, ε) to the model (in this case x∗ satisfies the KKT conditions). He uses the term model completeness in reference to the case where these sufficient conditions for the statistical model to have 11 Coherency pertains to the case where there is a unique vector x∗ generated by the KKT conditions

corresponding to each possible value of ε, and there is a unique value of ε that generates each possible vector x∗ generated by the KKT conditions.



CHAPTER 1 Microeconometric models of consumer demand

a well-defined likelihood. For our neoclassical model of demand, the econometric model would be termed “incomplete” if demand was a correspondence and, hence, there were multiple values of x∗ that satisfy the KKT conditions at a given (p, y; θ, ε). van Soest et al. (1993) propose a set of parameter restrictions that are sufficient for coherency. For many specifications, these conditions will only ensure that the regularity of U (x) holds over the set of prices and quantities observed in the data. While these conditions may suffice for estimation, failure of the global regularity condition could be problematic for policy simulations that use the demand estimates to predict outcomes outside the range of observed values in the sample. For many specifications, the parameter restrictions may not have an analytic form, and may require numerical tools to impose them. As we will see in the examples below, the literature has often relied on special functional forms, with properties like additivity and homotheticity, to ensure global regularity and to satisfy the coherency conditions. However, these specifications come at the cost of less flexible substitution patterns. In addition to coherency concerns, maximum likelihood estimation based on Eq. (9) also involves several computational challenges. If the system of KKT conditions does not generate a closed-form expression for the conditional demand equations, it may be difficult to impose coherency conditions. In addition, the likelihood comprises a density component for the goods with non-zero consumption and a mass component for the corners at which some of the goods have an optimal demand of zero. The mass component in (9) requires evaluating an l-dimensional integral over a region defined implicitly by the solution to the  KKT conditions (4). When consumers purchase l of the alternatives, there are Jl potential shopping baskets, and each of the observed combinations would need to be solved. The likelihood also involves two change-of-variables from ε to ε˜ and from ε˜ to xˆ respectively, requiring the computation of a Jacobian matrix. Estimation methods are beyond the scope of this discussion. However, a number of papers have proposed methods to accommodate several of the computational challenges above including simulated maximum likelihood (Kao et al., 2001), hierarchical Bayesian algorithms that use MCMC methods based on Gibbs sampling (Millimet and Tchernis, 2008), hybrid methods that combine Gibbs sampling with Metropolis-Hastings (Kim et al., 2002), and GMM estimation (Thomassen et al., 2017). In the remainder of this section, we discuss several examples of functional forms for U (x) that have been implemented in practice.

3.1.2 Example: Quadratic utility Due to its tractability, the quadratic utility, U (x; θ, ε) =


+1 j =1

J +1 J +1  1

βj 0 + εj xj + βj k xj xk 2


j =1 k=1

has been a popular functional form for empirical work (e.g., Wales and Woodland, 1983; Ransom, 1987; Lambrecht et al., 2007; Mehta, 2015; Yao et al., 2012;

3 The neoclassical derivation of an empirical model

Thomassen et al., 2017).12 The random utility shocks in (11) are “random coefficients” capturing heterogeneity across consumers in the linear utility components over the various products. Assume WLOG that the consumer foregoes consumption on goods xj = 0 (j = 1, ..., l), and chooses a positive quantity for goods xj > 0 (j = l + 1, ..., J + 1). The corresponding KKT conditions are ε˜ j + βj 0 + ε˜ j + βj 0 +

J +1

  +1 βj k xj∗ − βJ +1,0 + Jk=1 βJ +1,k xJ∗ +1 pj ≤ 0,

j = 1, ..., l

J +1

  +1 βj k xj∗ − βJ +1,0 + Jk=1 βJ +1,k xJ∗ +1 pj = 0,

j = l + 1, ..., J



(12) where ε˜ j = εj − pj εJ +1 and, by the symmetry condition, βj k = βkj . Since the quadratic utility function homogeneous of degree zero in the parameters, we im is+1 pose the normalization Jj =1 βj 0 = 1. We have also re-written the estimation problem in terms of differences, ε ˜ to resolve the adding-up condition.    ˜ If ε˜ ∼ N 0,  and the consumer purchases xˆ = 0, ..., 0, xˆl+1 , ..., xˆJ , the corresponding likelihood is13   fx xˆ =

  hl xˆ ;θ

  h1 xˆ ;θ

··· −∞

       fε ε˜ 1 , ..., ε˜ l , hl+1 xˆ ; θ , ..., hJ xˆ ; θ |J xˆ |d ε˜ 1 · · · d ε˜ l


(13)     +1 +1 βki xj∗ + βJ +1,0 + Jk=1 βJ +1,k xJ∗ +1 pj , fε (ε) where h xˆ ; θ = −βj 0 − Jk=1   is the density corresponding to N (0, ), and J xˆ is the Jacobian from ε˜ to (xl+1 , ..., xJ ). Ransom (1987) showed that the concavity of the quadratic utility function, (11), is sufficient for coherency of the maximum likelihood problem (13), even though monotonicity may not hold globally. Concavity is ensured if the matrix of cross-price effects, B, where Bj k = βj k , is symmetric and negative definite. The advantages of the quadratic utility function include the flexibility of the substitution patterns between the goods, including a potential blend of complements and substitutes. However, the specification does not scale well in the number of products J . The number of parameters increases quadratically with J due to the cross-price effects. Moreover, the challenges in imposing global regularity could be problematic for policy simulations using the demand parameters.

12 Thomassen et al. (2017) extend the quadratic utility model to allow for discrete store choice as well as

the discrete/continuous expenditure allocation decisions across grocery product categories within a visited store. 13 The density f (˜ ε ε ) is induced by f (ε) and the fact that ε˜ j = εj − pj εJ +1 .



CHAPTER 1 Microeconometric models of consumer demand

3.1.3 Example: Linear expenditure system (LES) One of the classic utility specifications in the demand estimation literature is the Stone-Geary model: U (x; θ ) =



  θj ln xj − θj 1 , θj > 0.


j =1

Similar to the CES specification, the parameters θj measure the curvature of the subutility of each product and affect the rate of satiation. The translation parameters θj 1 allow for potential corner solutions. The Stone-Geary preferences have been popular in the extant literature because the corresponding demand system can be solved analytically: xj∗ = θj 1 − θ˜j





pk y + θ˜j , j = 1, ..., J + 1 pj pj


θ where xj∗ > θj 1 , ∀j , and where θ˜j = j θ . The specification is often termed the k k “linear expenditure system” (LES) because the expenditure model is linear in prices

∗ θk1 pk . pj xj = θj 1 pj + θ˜j y − (16) k

Corner solutions with binding non-negativity constraints can arise when θj 1 ≤ 0 and, consequently, product j is “inessential” (Kao et al., 2001; Du and Kamakura, 2008). Assume WLOG that the consumer foregoes consumption on goods xj = 0 (j = 1, ..., l), and chooses a positive quantity for goods xj > 0 (j = l + 1, ..., J + 1). ¯ If we let θj = eθj +εj where θ¯J +1 = 0 then the KKT conditions are:     ε˜ j + θ¯j − ln −θj 1 + ln y − Jk=1 xk∗ pk − θJ +1,1 − ln pj ≤ 0, j = 1, ..., l ε˜ j + θ¯j − ln


− θj 1 + ln y −


∗ k=1 xk pk


− θJ +1,1 − ln pj = 0,

j = l + 1, ..., J where ε˜ j = εj − εJ +1 and θj 1 ≤ 0 for j = 1, ..., l.   If ε˜ ∼ N (0, ) and the consumer purchases xˆ = 0, ..., 0, xˆl+1 , ..., xˆJ , the corresponding likelihood is   fx xˆ ; θ =

  hl xˆ ;θ

  h1 xˆ ;θ

··· −∞

       fε ε˜ 1 , ..., ε˜ l , hl+1 xˆ ; θ , ..., hJ xˆ ; θ |J xˆ |d ε˜ 1 · · · d ε˜ l



3 The neoclassical derivation of an empirical model

      where hj xˆ ; θ = −θ¯j + ln −θj 1 − ln y − Jk=1 xk∗ pk − θJ +1,1 + ln pj , and   fε (ε) is the density corresponding to N (0, ) and J xˆ is the Jacobian from ε˜ to (xl+1 , ..., J ). Some advantages of the LES specification include the fact that the utility function is globally concave, obviating the need for additional restrictions to ensure model coherency. In addition, the LES scales better than the quadratic utility as the number of parameters to be estimated grows linearly with the number of product, J . However, the specification does not allow for the same degree of flexibility in the substitution patterns between goods. The additive separability of the sub-utility functions associated with each good implies that the marginal utility of one good is independent of the level of consumption of all the other goods. Therefore, the goods are assumed to be strict Hicksian substitutes and any substitution between products arises through the budget constraint. The additive structure also rules out the possibility of inferior goods (see Deaton and Muellbauer, 1980b, p. 139).

3.1.4 Example: Translated CES utility Another popular specification for empirical work is the translated CES utility function (Pollak and Wales, 1992; Kim et al., 2002):   U x∗ ; θ, ε



j =1

 α α +1 ψj xj + γj j + ψJ +1 xJ J+1


  where ψj = ψ¯ j exp εj > 0 is the stochastic perceived quality of a unit of product j , γj ≥ 0 is a translation of the utility, αj ∈ (0, 1] is a satiation parameter, and the  J +1 collection of parameters to be estimated consists of θ = αj , γj , ψ¯ j j =1 . This specification nests several well-known models such as the translated Cobb-Douglas or “linear expenditure system” (αj → 0) and the translated Leontieff (αj → −∞). For any product j , setting γj = 0 would ensure a strictly interior quantity, xj∗ > 0. The CES specification has also been popular due to its analytic solution when quantities demanded are strictly interior. See for instance applications to nutrition preferences by Dubois et al. (2014) and Allcott et al. (2018). For the more general case with corner solutions, the logarithmic form of the KKT conditions associated with the translated CES utility model are   ε˜ j ≤ hj xj∗ ; θ , ε˜ j = hj

xj∗ ; θ

j = 1, ..., l (20)

, j = l + 1, ..., J

     αJ +1 −1   αj −1  where hj xj∗ ; θ = ln ψ¯ J +1 αJ +1 xj∗ − ln ψ¯ j αj xj∗ + γj +   ln pj and ε˜ j = εj − εJ +1 .



CHAPTER 1 Microeconometric models of consumer demand

  If ε˜ ∼ N (0, ) and the consumer purchases xˆ = 0, ..., 0, xˆl+1 , ..., xˆJ , the corresponding likelihood is 

  hl xˆ ;θ

fx xˆ ; θ =

  h1 xˆ ;θ

··· −∞

       fε ε˜ 1 , ..., ε˜ l , hl+1 xˆ ; θ , ..., hJ xˆ ; θ |J xˆ |d ε˜ 1 · · · d ε˜ l


(21)   where fε (ε) is the density corresponding to N (0, ) and J xˆ is the Jacobian from ε˜ to (xl+1 , ..., J ). If instead we assume ε ∼ i.i.d. EV (0, σ ), Bhat (2005) and Bhat (2008) derive the simpler, closed-form expression for the likelihood with analytic solutions to the   integrals and the Jacobian J xˆ in (21) ⎤ ⎡   ⎡ ⎤⎡ ⎤ h x ˆ ;θ J +1 i i J +1 J

+1 ⎥ σ   pi ⎦ ⎢ 1 ⎥ ⎢ i=l+1 e fi ⎦ ⎣ fx xˆ ; θ = J −l ⎣ ⎥ (J − l)! ⎢   σ fi ⎣ J +1 hj xˆk ;θ J −l+1 ⎦ i=l+1 i=l+1 σ j =1 e (22)   where, changing the notation from above slightly, we define hj xj∗ ; θ = ψ¯ j +         i . αj − 1 ln xj∗ + γj − ln pj and fi = x1−α ∗ +γ i i A formulation of the utility function specifies an additive model of utility over stochastic consumption needs instead of over products (Hendel, 1999; Dubé, 2004) ⎛ ⎞α T J

 ∗ ⎝ ψ j t xj t ⎠ . U x ; θ, ψ = t=1

j =1

One interpretation is that the consumer shops in anticipation of T separate future consumption occasions (Walsh, 1995) where T ∼ Poisson (λ). The consumer draws the marginal utilities per  of each product independently across the  T consumption  unit  occasions, ψj t ∼ F ψ¯ j . The estimable parameters consist of θ = λ, ψ¯ 1 , ..., ψ¯ J , α . For each of the t = 1, ..., T occasions, the consumer has perfect substitutes preferences over the products and chooses a single alternative. The purchase of variety on a given trip arises from the aggregation of the choices for each of the consumption occasions. Non-purchase is handled by imposing indivisibility on the quantities; although a translation parameter like the one in (19) above could also be used if divisibility was allowed. Like the LES specification, the translated CES model is monotonic and quasiconcave, ensuring the consistency of the likelihood. The model also scales better than the quadratic utility as the number of parameters to be estimated grows linearly with the number of products, J . Scalability is improved even further by projecting the

3 The neoclassical derivation of an empirical model

perceived quality parameters, ψj , onto a lower-dimensional space of observed product characteristics (Hendel, 1999; Dubé, 2004; Kim et al., 2007). But, the translated CES specification derived above assumes the products are strict Hicksian substitutes, which limits the substitution patterns implied by the model.14 Moreover, with a large number of goods and small budget shares, the implied cross-price elasticities will be small in these models (Mehta, 2015).

3.1.5 Virtual prices and the dual approach Thus far, we have used a primal approach to derive the neoclassical model of demand with binding non-negativity constraints from a parametric model of utility. Most of the functional forms used to approximate utility in practice impose restrictions motivated by technical convenience. As we saw above, these restrictions can limit the flexibility of the demand model on factors such as substitution patterns between products and income effects. For instance, the additivity assumption resolves the global regularity concerns, but restricts the products to be strict substitutes. Accommodating more flexible substitution patterns becomes computationally difficult even for a simple specification like quadratic utility due to the coherency conditions. The dual approach has been used to derive demand systems from less restrictive assumptions Deaton and Muellbauer (1980b). Lee and Pitt (1986) developed an approach to use duality to derive demand while accounting for with binding nonnegativity constraints using virtual prices. The advantage of the dual approach is that a flexible functional form can be used to approximate indirect utility and cost functions can be used to determine the relevant restrictions to ensure that the derived demand system is consistent with microeconomic principles. The trade-off from using this dual approach is that the researcher loses the direct connection between the demand parameters and their deep structural interpretation as specific aspects of “preferences.” The specifications may be less suitable for marketing applications to problems such as product design, consumer quality choice and the valuation of product features, these specifications. We begin with the consumer’s indirect utility function   (23) V (p, y; θ, ε) = max U (x; θ, ε) |p x = y x∈RJ +1

where the underlying utility function U (x; θ, ε) is again assumed to be strictly quasiconcave, continuously differentiable, and increasing. Roy’s Identity generates a system of notional demand equations x˜j (p, y; θ, ε) = −

∂V (p,y;θ,ε) ∂pj ∂V (p,y;θ,ε) ∂y

, ∀j.


14 Kim et al. (2002) apply the model to the choices between flavor variants of yogurt where consumption

complementarities are unlikely. However, this restriction would be more problematic for empirical studies of substitution between broader commodity groups.



CHAPTER 1 Microeconometric models of consumer demand

These demand equations are notional because they do not impose non-negativity and can therefore allow for negative values. In essence, x˜ is a latent variable since it is negative for products that are not purchased. Note that Roy’s identity requires that prices are fixed and independent of the quantities purchased by the consumer, an assumption that fails in settings where firms use non-linear pricing such as promotional quantity discounts.15 Lee and Pitt (1986) use virtual prices to handle products with zero quantity demanded (Neary and Roberts, 1980). Suppose the consumer’s optimal consumption ∗ , .., x ∗ vector is x∗ = 0, ..., 0, xl+1 J +1 where, as before, she does not purchase the first l goods. We can define virtual prices based on Roy’s Identity in Eq. (24) that exactly set the notional demands to zero for the non-purchased goods 0=

¯ y; θ, ε) , p, ¯ y; θ, ε) ∂V (π (p, , j = 1, ..., l ∂pj

¯ y; θ, ε) , ..., πl (p, ¯ y; θ, ε)) is the l-vector of virtual ¯ y; θ, ε) = (π1 (p, where π (p, prices and p¯ = (pl+1 , ..., pJ ). These virtual prices act like reservation prices for the non-purchased goods. We can derive the positive demands for goods j = l + 1, ..., J + 1 by substituting the virtual prices into Roy’s identity: ¯ ¯ ∂V (π(p,y;θ,ε), p,y;θ,ε) ∂pj ∗ ¯ xj (p, y; θ, ε) = − ∂V (π(p,y;θ,ε), ¯ ¯ p,y;θ,ε) ∂y

, j = l + 1, ..., J + 1.


The regime switching conditions in which products j = 1, ..., l are not purchased consist of comparing virtual prices and observed prices: ¯ y; θ, ε) ≤ pj , j = 1, ..., l. πj (p,


Lee and Pitt (1986) demonstrate the parallel between the switching conditions based on virtual prices in (26) and the binding non-negativity constraints in the KKT conditions, (4). The demand parameters θ can then be estimated by combining the conditional demand system, (25), and the regime-switching conditions, (4). If the consumer pur  chases xˆ = 0, ..., 0, xˆl+1 , ..., xˆJ +1 , the corresponding likelihood is   fx xˆ ; θ ∞ ∞     ∗−1  ¯ y; θ , ..., xJ∗−1 xˆ , p, ¯ y; θ xˆ , p, ··· fε ε1 , ..., εl , xl+1 = πl−1 (p,y,pl ;θ)

π1−1 (p,y,p1 ;θ)

  × |J xˆ |d ε˜ 1 · · · d ε˜ l


15 Howell et al. (2016) show how a primal approach with a parametric utility specification can be used in

the presence of non-linear pricing.

3 The neoclassical derivation of an empirical model

  where fε (ε) is the density corresponding to N (0, ) and J xˆ is the Jacobian from ε˜   to (xl+1 , ..., xJ ). The inverse functions in (27) reflect the fact that πj−1 p, y, pj ; θ ≤   ¯ y; θ = εj for j = l + 1, ..., J . εj for j = 1, ..., l and xj∗−1 xˆ , p, As with the primal problem, the choice of functional form for the indirect utility, V (p, y; θ, ε), can influence the coherency of the maximum likelihood estimator for θ in (27). van Soest et al. (1993) show that the uniqueness of the demand function defined by Roy’s Identity in (24) will hold if the indirect utility function V (p, y; θ, ε) satisfies the following three regularity conditions: 1. V (p, y; θ, ε) is homogeneous of degree zero. 2. V (p, y; θ, ε) is twice continuously differentiable in p and y. 3. V (p, y; θ, ε) is regular, meaning that the Slutsky matrix is negative semi-definite. For many popular and convenient flexible functional forms, the Slutsky matrix may fail to satisfy negativity leading to the coherency problem. In many of these cases, such as AIDS, the virtual prices may need to be derived numerically, making it difficult to derive analytic restrictions that would ensure these regularity conditions hold. For these reasons, the homothetic translog specification discussed below has been extremely popular in practice.

3.1.6 Example: Indirect translog utility One of the most popular implementations of the dual approach described above in Section 3.1.5 uses the translog approximation of the indirect utility function (e.g., Lee and Pitt, 1986; Millimet and Tchernis, 2008; Mehta, 2015)

V (p, y; θ, ε) =



 θj ln

j =1

pj y


    J +1 J +1 pj 1

pk θj k ln ln . 2 y y


j =1 k=1

The econometric error is typically introduced by assuming θj = θ¯j + εj where εj ∼ F (ε). Roy’s Identity gives us the notional expenditure share for product j CHE



+1 p −θj − Jk=1 θj k ln yk  . pl 1− k l θkl ln y


van Soest and Kooreman (1990) derived slightly weaker sufficient conditions for coherency of the translog approach than van Soest et al. (1993). Following van Soest and Kooreman (1990), we impose the following additional restrictions which are sufficient for the concavity of the underlying utility function and, hence, the uniqueness



CHAPTER 1 Microeconometric models of consumer demand

of the demand system (29) for a given realization of ε: θJ +1 = 1 −

k θj k

j θj

= 0, ∀j


θj k = θkj , ∀j We can re-write the expenditure share for product j sj

= −θj −


k=1 θj k ln pk .


We can see from (31) that an implication of the restrictions in (30) is that they also impose homotheticity on preferences. For the translog specification, Mehta (2015) derived necessary and sufficient conditions for global regularity that are even weaker than the conditions in van Soest and Kooreman (1990). These conditions allow for more flexible income effects (normal and inferior goods) and for more flexible substitution patterns (substitutes and complements), mainly by relaxing homotheticity.16

3.2 The discrete/continuous product choice restriction in the neoclassical model Perhaps due to their computational complexity, the application of the microeconometric models of variety discussed in Section 3 has been limited. However, the empirical regularities documented in Section 2 suggest that simpler models of discrete choice, with only a single product being chosen, could be used in many settings. Recall from Section 2 that the average category has a single product choice chosen during 97% of trips. We now examine how our demand framework simplifies under discrete product choice. The discussion herein follows Hanemann (1984); Chiang and Lee (1992); Chiang (1991); Chintagunta (1993).

3.2.1 The primal problem The model in this section closely follows the simple re-packaging model with varieties from Deaton and Muellbauer (1980b). In these models, the consumption utility for a given product is based on its effective quantity consumed, which scales the quantity by the product’s quality. As before, we assume the commodity group of interest comprises j = 1, ..., J substitute products. Products are treated as perfect substitutes so that, at most, a single variant is chosen. We also assume there is an additional essential numeraire good indexed as product J + 1.

16 In an empirical application to consumer purchases over several CPG product categories, Mehta (2015)

finds the proposed model fits the data better than the homothetic translog specification. However, when J = 2, the globally regular translog will exhibit the restrictive strict Hicksian substitutes property.

3 The neoclassical derivation of an empirical model

To capture discrete product choice within the commodity group, we assume the following bivariate utility over the commodity group and the essential numeraire: ⎛

U x∗ ; θ, ψ = U˜ ⎝


⎞ ψj xj , ψJ +1 xJ +1 ⎠ .


j =1

The parameter vector ψ = (ψ1 , ..., ψJ +1 ), ψj ≥ 0 measures the constant marginal utility of each of the products. In the literature, we often refer to ψj as the “perceived quality” of product j . Specifying the perceived qualities as random variables, ψ ∼ F (ψ), introduces random utility as a potential source of heterogeneity across consumers in their perceptions of product quality. We also assume regularity conditions on U (x∗ ; θ, ψ) to ensure that a positive quantity of xJ +1 is always chosen. To simplify the notation, let the total commodity vector be z1 = Jj=1 ψj xj and let z2 = ψJ +1 xJ +1 so that we can re-write utility as U˜ (z1 , z2 ). The KKT conditions are   ∂ U˜ ψ  x,ψJ +1 xJ +1 ψj ∂z1  ∂ U˜ ψ  x,ψ

  ∂ U˜ ψ  x,ψJ +1 xJ +1 ψJ +1 pj ∂z2

≤ 0, j = 1, ..., J



J +1 J +1 where is the marginal utility of total quality-weighted consumption ∂z1 within the commodity group. Because of the perfect substitutes specification, if a product within the commodity group is chosen, it will be product k if

pk ψk



pj ψj

J j =1

and, hence, k exhibits the lowest price-to-quality ratio. As with the general model in Section 3, demand estimation will need to handle the regime switching, or demand  selection conditions. If pk >

∂ U˜ z1∗ ,z2∗ ;θ,ψ ∂z1   ∂ U˜ z1∗ ,z2∗ ;θ,ψ ∂z2


, then the consumer spends her entire

ψJ +1

budget on the numeraire: xJ∗ +1 = y. Otherwise, the consumer allocates her budget   ∂ U˜ z1∗ ,z2∗ ;θ,ψ ψk ∂z1 pk  ∂ U˜ z1∗ ,z2∗ ;θ,ψ pj ψJ +1 ∂z2   ∂ U˜ z1∗ ,z2∗ ;θ,ψ

between xk∗ and xJ∗ +1 to equate We define hj (x∗ ; θ, ψ) =


  ∂ U˜ z1∗ ,z2∗ ;θ,ψ ψJ +1 . ∂z2

. When none of the products are cho-


sen, we can write the likelihood of xˆ = (0, ..., 0) as   fx xˆ ; θ =



  hJ xˆ ;θ,ψJ +1

  h1 xˆ ;θ,ψJ +1

··· −∞

fψ (ψ) dψ1 · · · dψJ +1 . −∞




CHAPTER 1 Microeconometric models of consumer demand

  When product 1 (WLOG) is chosen, we can write the likelihood of xˆ = xˆ1 , 0, ..., 0 as 


  hJ xˆ ;θ,ψJ +1

  h2 xˆ ;θ,ψJ +1

fx xˆ ; θ =

    fψ h1 xˆ ; θ, ψ , ψ2 , ..., ψJ +1





  × |J xˆ |dψ2 · · · dψJ +1

(35)   where J xˆ is the Jacobian from ψ1 to xˆ1 . The likelihood now comprises a density component for the chosen alternative j = 1, and a mass function for the remaining goods.

3.2.2 Example: Translated CES utility Recall the translated CES utility function presented in Section 3.1.4 (Bhat, 2005; Kim et al., 2002):   U x∗ ; θ, ε



j =1

 α α +1 ψj xj + γj j + ψJ +1 xJ J+1 .

We can impose discrete product choice with the restrictions αj = 1, γj = 0 for j = 1, ..., J , which gives us perfect substitutes utility over the brands   U x∗ ; θ, ε



j =1

ψj xj + ψJ +1 xJα+1 .

Let ψj = exp ψ¯ j εj , j = 1, ..., J and ψJ +1 = exp (εJ +1 ), where εj ∼ i.i.d. EV (0, σ ) (Deaton and Muellbauer, 1980a; Bhat, 2008). When none of the products are chosen, we can write the likelihood of xˆ = (0, ..., 0) as   fx xˆ =



j =1 exp


    ψ¯ j −ln pj −ln αy (α−1) σ

which is the multinomial logit model.  If WLOG  alternative 1 is chosen in the commodity group, the likelihood of xˆ = xˆ1 , ..., 0 is   f xˆ ; θ =

   J Vj −εJ +1

exp σ σ j =1 −∞      Vj −εJ +1 × exp − exp exp fε (εJ +1 ) dεJ +1 σ σ 

α−1 y − xˆ1

1 σ





  Vj ≡ ψ¯ j − ln pj

3 The neoclassical derivation of an empirical model


  α−1      hj ψJ +1 ; xˆ1 , p, θ = ln ψJ +1 α y − xˆ1 + ln pj − ψ¯ j

and     Pk ≡ Pr εk + ψ¯ k − ln (pk ) ≥ εj + ψ¯ j − ln pj , j = 1, ..., J  ¯ k) exp ψk −ln(p σ  . = ¯ ψj −ln pj J exp j =1 σ

3.2.3 Example: The dual problem with indirect translog utility Muellbauer (1974) has shown that maximizing the simple re-packaging model utility  function in (32) generates a bivariate indirect utility function of the form V ψpkky , ψJ 1+1 y when product k is the preferred product in the commodity group of interest. Following the template in Hanemann (1984), several empirical studies of discrete/continuous brand choice have been derived based on the indirect utility function and the dual virtual prices. For instance, Chiang (1991) and Arora et al. (1998) use a second-order flexible translog approximation of the indirect utility function17 :    V (p, y; θ, ε) = θ1 ln ψpkky + θ2 ln ψJ 1+1 y + 12 θ11 ln ψpkky + 12 θ12 ln where

pk ψk

= min j


pj ψj

pk ψk y

ln ψJ 1+1 y + 12 θ22 ln ψJ 1+1 y


(36) 2


  , ψj = exp ψ¯ j + εj for j = 1, ..., J , ψJ +1 = exp (εJ +1 ), and

εj ∼ i.i.d. EV (0, σ ). To facilitate the exposition, we impose the following restrictions to ensure coherency. But, the restrictions lead to the homothetic translog specification which eliminates potentially interesting income effects in substitution between products (see the concerns discussed earlier in Section 3.1.5): θ1 + θ2 = −1 θ11 + θ12 = 0 θ12 + θ22 = 0.


Roy’s Identity gives us the notional expenditure share for product k sk

pk = −θ1 − θ11 ln ψ + θ11 ln ψJ +1 . k


17 See Hanemann (1984) for other specification including LES and PIGLOG preferences. See also Chin-

tagunta (1993) for the linear expenditure system or “Stone-Geary” specification.



CHAPTER 1 Microeconometric models of consumer demand

11 ψ1 +θ11 εJ +1 From (38), we see that ε1 = sˆ1 +θ1 +θ11 ln(p1θ)−θ . We can now compute the 11 quality-adjusted virtual price (or reservation price) for purchase by setting (38) to    θ1 +θ11 εJ +1 zero: R εJ +1 ; sˆ = exp − . θ11   pk > R εJ +1 ; sˆ and the likelihood of If none of the products are chosen, then ψ k sˆ = (0, ..., 0) is   exp σθθ111    . f sˆ; θ = (39)  ¯  ψ −ln p exp σθθ111 + Jj=1 exp j σ j

  If, WLOG, product 1 is chosen, the likelihood of sˆ = sˆ1 , 0, ..., 0 is  f sˆ =




1 σ θ11 P1 − − P1 e 1


  sˆ1 +θ1 +θ11 ln p1 −θ11 ψ1 +θ11 εJ +1 θ11 σ

  sˆ1 +θ1 +θ11 ln p1 −θ11 ψ1 +θ11 εJ +1 θ11 σ

fε (εJ +1 ) dεJ +1


where     Pk ≡ Pr εk + ψ¯ k − ln (pk ) ≥ εj + ψ¯ j − ln pj , j = 1, ..., J ¯  k) exp ψk −ln(p σ  . = ¯ ψj −ln pj J exp j =1 σ As discussed in Mehta et al. (2010), the distributional assumption εj ∼ i.i.d. EV (0, σ ) imposes a strong restriction on the price elasticity of the quantity purchased (conditional on purchase and brand choice), setting it very close to −1. This property can be relaxed by using a more flexible distribution, such as multivariate normal errors. Alternatively, allowing for unobserved heterogeneity in the parameters of the conditional budget share expression (38) would alleviate this restriction at the population level.

3.2.4 Promotion response: Empirical findings using the discrete/continuous demand model An empirical literature has used the discrete/continuous specification of demand to decompose the total price elasticity of demand into three components: (1) purchase incidence, (2) brand choice, and (3) quantity choice. This literature seeks to understand the underlying consumer choice mechanism that drives the observation of a large increase in quantities sold in response to a temporary price cut. In particular, the research assesses the extent to which a pure discrete brand choice analysis, focusing only on component (2) (see Section 3.3 below), might miss part of the price elasticity

3 The neoclassical derivation of an empirical model

and misinform the researcher or the retailer. Early work typically found that brandswitching elasticities accounted for most of the total price elasticity of demand in CPG product categories (e.g., Chiang, 1991; Chintagunta, 1993), though the unconditional brand choice elasticities were found to be larger than choice elasticities that condition on purchase. More recently, Bell et al.’s (1999) empirical generalizations indicate that the relative role of the brand switching elasticity varies across product categories. On average, they find that the quantity decision accounts for 25% of the total price elasticity of demand, suggesting that purchase acceleration effects may be larger than previously assumed. These results are based on static models in which any purchase acceleration would be associated with an increase in consumption. In Section 5.1, we extend this discussion to models that allow for forward-looking consumers to stock-pile storable consumer goods in anticipation of higher future prices.

3.3 Indivisibility and the pure discrete choice restriction in the neoclassical model The pure discrete choice behavior documented in the empirical stylized facts in Section 2 suggests a useful restriction for our demand models. In many product categories, the consumer purchases at most one unit of a single product on a given trip. Discrete choice behavior also broadly applies to other non-CPG product domains such as automobiles, computers, and electronic devices. The combination of pure discrete choice and indivisibility simplifies the discrete product choice model in Section 3.2 by eliminating the intensive margin of quantity choice, reducing the model to one of pure product choice. Not surprisingly, pure discrete choice models have become extremely popular for modeling demand both in the context of micro data on consumer-level choices and with more macro data on aggregate market shares. We now discuss the relationship between the classic pure discrete choice models of demand estimated in practice (e.g. multinomial logit and probit) and contrast them to the derivation of pure discrete choice from the neoclassical models derived above.

3.3.1 A neoclassical derivation of the pure discrete choice model of demand Recall from Section 3 where we defined the neoclassical economic model of consumer choice based on the following utility maximization problem:   V (p, y; θ, ε) ≡ max U (x; θ, ε) : x p ≤ y, x ≥ 0 x


where we assume U (•; θ, ε) is a continuously-differentiable, quasi-concave, and increasing function. In that problem, we assumed non-negativity and perfect divisibility, xj ≥ 0, for each of the J = 1, ..., J products and the J + 1 essential numeraire. We now consider the case of indivisibility on the j = 1, ..., J products by adding the restriction xj ∈ {0, 1} for j = 1, ..., J . We also assume strong separability (i.e. additivity) of xJ +1 and perfect substitutes preferences over the j = 1, ..., J products such



CHAPTER 1 Microeconometric models of consumer demand

that: ⎛ ⎞ K J

U⎝ ψj xj , ψJ +1 xJ +1 ; θ ⎠ = ψj xj + u˜ (xJ +1 ; ψJ +1 ) j =1

j =1

    where ψj = ψ¯ j + εj and u˜ xj +1 ; ψ¯ J +1 = u xJ +1 ; ψ¯ J +1 + εJ +1 . The KKT conditions will no longer hold under indivisible quantities. The consumer’s choice problem consists of making a discrete choice among the following J + 1 choice-specific indirect utilities:   vj = ψ¯ j + u y − pj ; ψ¯ J +1 + εj + εJ +1 = v¯j + εj + εJ +1 , xj = 1 (42)   vJ +1 = u y; ψ¯ J +1 + εJ +1 = v¯J +1 + εJ +1 , xJ +1 = y. The probability that consumer chooses alternative 1 ∈ {1, ..., J } is (WLOG) P r (x1 = 1) = P r (v1 ≥ vk , k = 1) = P r (εk ≤ v¯1 − v¯k + ε1 , ∀k = 1, ε1 ≥ v¯J +1 − v¯1 )  =

v¯J +1 −v¯1

v¯1 −v¯2 +x



v¯1 −v¯J +x



f (x, ε2 , ...., εJ ) dεJ · · · dε2 dx

where f (ε1 , ..., εJ ) is the density of (ε1 , ..., εJ ) and the probability of allocating the entire budget to the essential numeraire is simply P r (xJ +1 = y) = 1 −   J j =1 P r xj = 1 . If we assume (ε1 , ..., εJ ) ∼ i.i.d. EV (0, 1), the choice probabilities in (43) become              ¯ J ψ¯ k +u˜ y−pk ;ψ¯ J +1 −e −u˜ y;ψJ +1 exp ψ¯ 1 +u˜ y−p1 ;ψ¯ J +1 k=1 e  1 − e   Pr (x1 = 1) = J ¯ ¯ k=1

exp ψk +u˜ y−pk ;ψJ +1

          ¯ J ψ¯ k +u˜ y−pk ;ψ¯ J +1 −e −u˜ y;ψJ +1 k=1 e . Pr (xJ +1 = y) = 1 − e

(44) Suppose the researcher has a data sample with i = 1, ..., N independent consumer purchase observations. A maximum likelihood estimator of the model parameters can be constructed as follows: L (θ|y) =

J N i=1 j =1

 y Pr (xJ +1 = y)yiJ +1 Pr xj = 1 ij


3 The neoclassical derivation of an empirical model

where yij indicates whether observation i resulted in choice alternative j , and θ =   ψ¯ 1 , ..., ψ¯ J +1 . While the probabilities (44) generate a tractable maximum likelihood estimator based on (45), the functional forms do not correspond to the familiar multinomial logit specification used throughout the literature on discrete choice demand (McFadden, 1981).18 To understand why the neoclassical models from earlier sections do not nest the usual discrete choice models, note that the random utilities εJ +1 “difference out” in (43) and the model is equivalent to a deterministic utility for the decision to allocate the entire budget to the numeraire. This result arises because of the addingup condition associated with the budget constraint, which we resolved by assuming xJ∗ +1 > 0, just as we did in Section 3 above. Lee and Allenby (2014) extend the pure discrete choice model to allow for multiple discrete choice and indivisible quantities for each product. As before, assume there are j = 1, ..., J products and a J + 1 essential numeraire. To address the indivisibility of the j = 1, ..., J products, assume xj ∈ {0, 1, ...} for j = 1, ..., J . If   utility is concave, increasing, and additive,19 U (x) = Jj=1 uj xj + αJ +1 (xJ +1 ), the consumer’s decision problem consists of selecting an optimal quantity for each of the! products and the essential numeraire, subject " to her budget constraint. Let

= (x1 , ..., xJ ) |y − j xj pj ≥ 0, xj ∈ {0, 1, ...} be the set of feasible quantities that satisfy the consumer’s budget constraint, where xJ +1 = y − j xj pj . The consumer picks an optimal quantity vector x∗ ∈ such that U (x∗ ) ≥ U (x) ∀x ∈ . To derive a tractable demand solution, Lee and Allenby (2014) assume that utility has the following form20 : uj (x) =

    αj exp εj ln γj x + 1 . γj

The additive separability assumption is critical since it allows the optimal quantity of each brand to be determined separately. In particular, for each j = 1, ..., J the optimality of xj∗ is ensured if U (x1∗ , ..., xj∗ , ..., xJ∗ ) ≥ max{U (x1∗ , ..., xj∗ + , ..., xJ∗ )|x∗ ∈

,  ∈ {−1, 1}}. The limits of integration of the utility shocks, ε, can therefore be derived in closed form: 

fx xˆ ; θ =



  fε εj dεj

j =1 lbj

18 Besanko et al. (1990) study the monopolistic equilibrium pricing and variety of brands supplied in a

market with discrete choice demand of the form (44).

19 The tractability of this problem also requires assuming linearity in the essential numeraire u x J +1 =   α xJ +1 so that the derivation of the likelihood can be computed separately for each product alternative. 20 This specification is a special case of the translated CES model described earlier when the satiation

parameter asymptotes to 0 (e.g., Bhat, 2008).



CHAPTER 1 Microeconometric models of consumer demand

       α p γ γ xˆj +1 αJ +1 pj γj  where lbj = ln J +1αj j j − ln ln γ xˆj −1 = ln and ub − j α +1 j j j      γ xˆ +1 +1 . ln ln j γ jxˆ +1 j j

3.3.2 The standard pure discrete choice model of demand Suppose as before that the consumer makes a discrete choice between each of the products in the commodity group. We again assume a bivariate utility over an essential numeraire and a commodity group, with perfect substitutes over the j = 1, ..., J   J products in the commodity group: U j =1 ψj xj , xJ +1 . If we impose indivisibility on the product quantities such that xj ∈ {0, 1}, the choice problem once again becomes a discrete choice among the j = 1, ..., J + 1 alternatives   vj = U ψj , y − pj + εj , j = 1, ..., J (46) vJ +1 = U (0, y) + εJ +1 . In this case, the random utility εJ +1 does not “difference out” and hence we will end up with a different system of choice probabilities. If we again assume that (ε1 , ..., εJ +1 ) ∼ i.i.d. EV(0, 1), the corresponding choice probabilities have the familiar multinomial logit (MNL) form:   Pr (j ) ≡ Pr vj ≥ vk , for k = j    exp U ψj , y − pj = . exp (U (0, y)) + Jk=1 exp (U (ψk , y − pk ))


Similarly, assuming (ε1 , ..., εJ +1 ) ∼ N (0, ) would give rise to the standard multinomial probit. We now examine why we did not obtain the same system of choice probabilities as in the previous section. Unlike the derivation in the previous section, the random utilities in (46)  as primitive assumptions on the underlying utility  were not specified J function U j =1 ψj xj , xJ +1 . Instead, they were added on to the choice-specific values. An advantage of this approach is that it allows the researcher to be more agnostic about the exact interpretation of the errors. In the econometrics literature, ε are interpreted as unobserved product characteristics, unobserved utility or tastes, measurement error or specification error. However, the probabilistic choice model has also been derived by mathematical psychologists (e.g., Luce, 1977) who interpret the shocks as psychological states, leading to potentially non-rational forms of behavior. Whereas econometricians interpret the probabilistic choice rules in (47) as the outcome of utility maximization with random utility components, mathematical psychologists interpret (47) as stochastic choice behavior (see the discussion in Anderson et al., 1992, Chapters 2.4 and 2.5). A more recent literature has derived the multinomial logit from a theory of “rational inattention.” Under rational inattention,

3 The neoclassical derivation of an empirical model

the stochastic component of the model captures a consumer’s product uncertainty and the costs of endogenously reducing uncertainty through search (e.g., Matejka and McKay, 2015; Joo, 2018). One approach to rationalize the system (47) is to define the J + 1 alternative as an additional non-market good with price p0 = 0, usually defined as “home production” (e.g., Anderson and de Palma, 1992). We assume the consumer always chooses at least one of the J + 1 alternative. In addition, we introduce a divisible, essential numeraire good, z, with price pz = 1, so that the consumer has bivariate   utility over the J +1 total consumption of the goods and over the essential numeraire: U j =0 ψj xj , z . The choice-specific values correspond exactly to (46) and the shock εJ +1 is now interpreted as the random utility from home production. This model differs from the neoclassical models discussed in Sections 3 and 3.2 because we have now included an additional non-market good representing household production. For example, suppose a consumer has utility: ⎛ ⎞ ⎛ ⎞ J

+1 J

+1 U⎝ ψ j xj , z ⎠ = ⎝ ψj xj ⎠ exp (αz) j =0

j =0

where goods j = 1, ..., J + 1 are indivisible, perfect substitutes each with perceived qualities ψj = exp ψ¯ j + εj , where we normalize ψ¯ J +1 = 1, and where α is preference for the numeraire good. In this case, the choice-specific indirect utilities would be (in logs)   vj = ψj + α y − pj + εj , j = 1, ..., J vJ +1 = αy + εJ +1 . The MNL was first applied to marketing panel data for individual consumers by Guadagni and Little (1983). The linearity of the conditional indirect utility function explicitly rules out income effects in the substitution patterns between the inside goods. We discuss tractable specifications that allow for income effects in Section 4.1 below. If income is observed, then income effects in the substitution between the commodity group and the essential numeraire can be incorporated by allowing for non-linearity in the utility of the numeraire. For instance, if U˜ (xJ +1 ) = ψJ +1 ln (xJ +1 ) then we get choice probabilities21 Pr (k; θ ) =

exp (ψk + ψJ +1 ln (y − pk ))  .   exp (ψJ +1 ln (y)) + Jj=1 exp ψj + ψJ +1 ln y − pj


This specification also imposes an affordability condition by excluding any alternative for which pj > y. 

21 We can derive this specification from the assumption of Cobb-Douglas utility: U x , ..., x 1 J +1 ; θ =  ψ  J +1 J +1 x exp ψ x . j j J +1 j =1



CHAPTER 1 Microeconometric models of consumer demand

The appeal of the MNL’s closed-form specification comes at a cost for demand analysis. If U˜ (xJ +1 ) = ψJ +1 xJ +1 as is often assumed in the literature, the model exhibits the well-known Independence of Irrelevant Alternatives property. The IIA property can impose unrealistic substitution patterns in demand analysis. At the individual consumer level, the cross-price elasticity of demand is constant: ∂Pr (j ) pk = ψJ +1 Pr (k) pk ∂pk Pr (j ) so that substitution patterns between products will be driven by their prices and purchase frequencies, regardless of attributes. Moreover, a given product competes uniformly on price with all other products. One solution is to use a non-IIA specification. For instance, error components variants of the extreme value distribution, like nested logit and the generalized extreme value distribution, can relax the IIA property within pre-determined groups of products (e.g., McFadden, 1981; Cardell, 1997).22 If we instead assume that ε ∼ N (0, ) with appropriately scaled covariance matrix , we obtain the multinomial probit (e.g., McCulloch and Rossi, 1994; Goolsbee and Petrin, 2004). Dotson et al. (2018) parameterize the covariance matrix, , using product characteristics to allow for a scalable model with correlated utility errors and, hence, stronger substitution between similar products. When consumer panel data are available, another solution is to use a random coefficients specification that allows for more flexible aggregate substitution patterns (see Chapter 2 of this volume). In their seminal application of the multinomial logit to consumer-level scanner data, Guadagni and Little (1983) estimated demand for the ground coffee category using 78 weeks of transaction data for 2,000 households shopping in 4 Kansas City Supermarkets. Interestingly, they found that brand and pack size were the most predictive attributes for consumer choices. They also included the promotional variables “feature ad” and “in-aisle display” as additive utility shifters. These variables have routinely been found to be predictive of consumer choices. However, the structural interpretation of a marginal utility from a feature ad or a display is ambiguous. While it is possible that consumers obtain direct consumption value from a newspaper ad or a display, it seems more likely that these effects are the reduced-form of some other process such as information search. Exploring the structural foundations of the “promotion effects” remains a fruitful area for future research.

22 Misra (2005) shows that the disutility minimization formulation of the multinomial logit (or “reverse

logit”) leads to a different functional form of the choice probabilities that does not exhibit the IIA property.

4 Some extensions to the typical neoclassical specifications

4 Some extensions to the typical neoclassical specifications 4.1 Income effects Most of the empirical specifications discussed earlier imposed regularity conditions that, as a byproduct, impose strong restrictions on the income effects on demand. Since the seminal work by Engel (1857), the income elasticity of demand has been used to classify goods based on consumption behavior. Goods with a positive income elasticity are classified as Normal goods, for which consumers increase their consumption as income increases. Goods with a negative income elasticity are classified as Inferior goods, for which consumers decrease their consumption as income increases. Engel’s law is based on the empirical observation that households tend to allocate a higher proportion of their income to food as they become poorer (e.g., Engel, 1857). Accordingly, we define necessity goods and luxury goods based on whether the income elasticity of demand is less than or greater than one. Homothetic preferences restrict all products to be strict Normal goods with an income elasticity of one, thereby limiting the policy implications one can study with the model. Quasilinear preferences over the composite “outside” good restrict the income elasticity to zero, eliminating income effects entirely. When the empirical focus is on a specific product category for a low-priced item like a CPG product, it may be convenient to assume that income effects are likely to be small and inconsequential.23 This assumption is particularly convenient when a household’s income or shopping budget is not observed. However, overly restrictive income effects can limit a model’s predicted substitution patterns, leading to potentially adverse policy implications (see McFadden’s forward to Anderson et al. (1992) for a discussion). Even when a household’s income is static, large changes in relative prices could nevertheless create purchasing power effects. Consider the bivariate utility function specification with perfect substitutes in the focal commodity group  J from Section 3.2: U˜ ψ j =1 j xj , xJ +1 . For the products in the first commodity p

pk ≤ ψjj for all j = k. When congroup, consumers will select the product k where ψ k sumers face the same prices and have homogeneous quality perceptions, they would all be predicted to choose the same product. Changes in a consumer’s income would change the relative proportion of income spent on the commodity group and the essential numeraire. But the income change would not affect her choice of product. Therefore, homotheticity may be particularly problematic in vertically differentiated product categories where observed substitution patterns may be asymmetric between products in different quality tiers. For instance, the cross-elasticity of demand for lower-quality products with respect to premium products’ prices may be higher than the cross-elasticity of demand for higher-quality products with respect to the lowerquality products’ prices (e.g., Blattberg and Wisniewski, 1989; Pauwels et al., 2007).

23 Income effects are typically incorporated into demand analyses of high-priced, durable consumption

goods like automobiles (e.g., Berry et al., 1995).



CHAPTER 1 Microeconometric models of consumer demand

Similarly, Deaton and Muellbauer (1980b, p. 262) observed a cross-sectional income effect: “richer households systematically tend to buy different qualities than do poorer ones.” Gicheva et al. (2010) found a cross-time income effect by showing that lower-income households responded to higher gasoline prices by substituting their grocery purchases towards promotional-priced items, which could be consistent with asymmetric switching patterns if lower-quality items are more likely to be promoted. Similarly, Ma et al. (2011) found that households respond to increases in gasoline prices by substituting from national brands to lower-priced brands and to unadvertised own brands supplied by retailers, or “private labels.” These substitution patterns suggest that national brands are normal goods.24

4.1.1 A non-homothetic discrete choice model Given the widespread use of the pure discrete choice models, like logit and probit, we now discuss how to incorporate income effects into these models without losing their empirical tractability. To relax the homotheticity property in the simple re-packaging model with perfect substitutes from Section 3.2 above, Deaton and Muellbauer (1980a) and Allenby and Rossi (1991) introduce rotations into the system of linear indifference curves by defining utility implicitly: ⎛ ⎞

  ψj U¯ , ε xj , xJ +1 ⎠ . U (x; θ, ε) = U˜ ⎝ (49) j

The marginal utilities in this specificationvary with the level of total attainable utility at the current prices, U¯ . If we interpret ψ U¯ , ε as “perceived quality,” then we allow the marginal value of perceived quality to vary with the level of total attainable utility. To ensure the marginal utilities are positive, Allenby and Rossi (1991) and Allenby et al. (2010) use the empirical specification     ψj U¯ , ε = exp θj 0 − θj 1 U (x; θ ) + εj where εj is a random utility shock as before. If θj 1 > 0, then utility is increasing and concave. The model nests the usual homothetic specification when θj 1 = 0 for each product j . To see that the parameters θj 1 also capture differences in perceived quality, consider the relative marginal utilities:       ψk U¯   = exp θk0 − θj 0 + θj 1 − θk1 U (x; θ ) + εk − εj . ψj U¯ The relative perceived quality of product k increases with the level of attainable utility, U¯ , so long as θk1 < θj 1 , and so k would be perceived as superior to j . The identification of θk0 comes from the average propensity to purchase product k whereas the 24 Although, Dubé et al. (2017a) find highly income-inelastic demand for private label CPGs identified

off the large household income shocks during the Great Recession.

4 Some extensions to the typical neoclassical specifications

identification of θk1 comes from the substitution towards k in response to changes in purchasing power either through budgetary changes to y or through changes in the overall price level. Demand estimation is analogous to the discrete-continuous case in Section 3.2, except for the additional calculation of U¯ . Consider the example of pure discrete choice with perfect-substitutes and Cobb-Douglas Utility as in Allenby et al. (2010): ⎛ ⎞

  U (x; θ, ε) = ln ⎝ ψj U¯ , ε xj ⎠ + ψJ +1 ln xJ +1 j

where xj ∈ {0, 1} for j = 1, ..., J . The consumer chooses between a single unit of one of the j = 1, ..., J products or the J + 1 option of allocating the entire budget to the outside good with the following probabilities:   exp θk0 − θk1 U¯ k − ψJ +1 ln (y − pk )    Pr (xk = 1; θ) = 1+ exp θj 0 − θj 1 U¯ j − ψJ +1 ln y − pj   j |j ≤J and pj ≤y

where U¯ k is solved numerically as the solution to the implicit equation   ln U¯ k = θk0 − θk1 U¯ k − ψJ +1 ln (y − pk ) .


Maximum likelihood estimation will therefore nest the fixed-point calculation to (50) at each stage of the parameter search. In their empirical case study of margarine purchases, Allenby and Rossi (1991) find that the demand for generic margarine is considerably more elastic in the price of the leading national brand than vice versa. This finding is consistent with the earlier descriptive findings regarding asymmetric substitution patterns, the key motivating fact for the non-homothetic specification. Allenby et al. (2010) project the brand intercepts and utility rotation parameters, θj 0 and θj 1 , respectively, onto advertising to allow the firms’ marketing efforts to influence the perceived superiority of their respective brands. In an application to survey data from a choice-based conjoint survey with a randomized advertising treatment, they find that ads change the substitution patterns in the category by causing consumers to allocate more spending to higherquality goods.

4.2 Complementary goods The determination of demand complementarity figured prominently in the consumption literature (see the survey by Houthakker, 1961). But, the microeconometric literature tackling demand with corner solutions has frequently used additive models that explicitly rule out complementarity and assume products are strict substitutes (Deaton and Muellbauer, 1980b, pp. 138-139). For many product categories, such as laundry detergents, ketchups, and refrigerated orange juice, the assumption of strict



CHAPTER 1 Microeconometric models of consumer demand

substitutability seems reasonable for most consumers. However, in other product categories where consumers purchase large assortments of flavors or variants, such as yogurt, carbonated soft drinks, beer, and breakfast cereals, complementarity may be an important part of choices. For a shopping basket model that accounts for the wide array of goods, complementarity seems quite plausible between broader commodity groups (e.g. pasta and pasta sauce, or cake mix and frosting). Economists historically defined complementarity based on the supermodularity of the utility function and the increasing differences in utility associated with joint consumption. Samuelson (1974) provides a comprehensive overview of the arguments against such approaches based on the cardinality of utility. Chambers and Echenique (2009) formally prove that supermodularity is not testable with data on consumption expenditures. Accordingly, most current empirical research defines complementarity based on demand behavior, rather than as a primitive assumption about preferences.25 Perhaps the most widely-cited definition of complementarity comes from Hicks and Allen (1934) using compensated demand: Definition 1. We say that goods j and k are complements if an increase in the price of j leads to a decrease in the compensated demand for good k, substitutes if an increase in the price of j leads to a increase in the compensated demand for good k, independent if an increase in the price of j has no effect on the compensated demand for good k. This definition has several advantages including symmetry and the applicability to any number of goods. However, compensated demand is unlikely to be observed in practice. Most empirical research tests for gross complementarity, testing for the positivity of the cross-derivatives of Marshallian demands with respect to prices. The linear indifference curves used in most pure discrete choice models eliminates any income effects, making the two definitions equivalent. A recent literature has worked on establishing the conditions under which an empirical test for complementarity is identified with standard consumer purchase data (e.g., Samuelson, 1974; Gentzkow, 2007; Chambers et al., 2010). The definition of complementarity based on the cross-price effects on demand can be problematic in the presence of multiple goods. Samuelson (1974, p. 1255) provides the following example: ... sometimes I like tea and cream... I also sometimes take cream with my coffee. Before you agree that cream is therefore a complement to both tea and coffee, I should mention that I take much less cream in my cup of coffee than I do in my cup of tea. Therefore, a reduction in the price of coffee may reduce my demand for cream, which is an odd thing to happen between so-called complements.

25 An exception is Lee et al. (2013) who use a traditional definition of complementarity based on the sign

of the cross-partial derivative of utility.

4 Some extensions to the typical neoclassical specifications

To see how this could affect a microeconometric test, consider the model of bivariate utility over a commodity group defined as products in the coffee and cream categories, and an essential numeraire that aggregates expenditures on all other goods (including tea). Even with flexible substitution patterns between coffee and cream, empirical analysis could potentially produce a positive estimate of the cross-price elasticity of demand for cream with respect to the price of coffee if cream is more complementary with tea than with coffee. On the one hand, this argument highlights the importance of multi-category models, like the ones we will discuss in Section 4.2.2 below, that consider both the intra-category and inter-category patterns of substitution. For instance, one might specify a multivariate utility over all the beverage-related categories and the products within each of the categories. The multi-category model would characterize all the direct and indirect substitution patterns between goods Ogaki (1990, p. 1255). On the other hand, a multi-category model increases the technical and computational burden of demand estimation dramatically. As discussed in Gentzkow (2007, p. 720), the estimated quantities in the narrower, single-category specification should be interpreted as “conditional on the set of alternative goods available in the market.” The corresponding estimates will still be correct for many marketing applications, such as the evaluation of the marginal profits of a pricing or promotional decision. The estimates will be problematic if there is a lot of variation in the composition of the numeraire that, in turn, changes the specification of utility for the commodity group of interest. Our discussion herein focuses on static theories of complementarity. While several of the models in Section 3 allow for complementarity, the literature has been surprisingly silent on the identification strategies for testing complementarity. An exception is Gentzkow (2007), which we discuss in more detail below. A burgeoning literature has also studied the complementarities that arise over time in durable goods markets with inter-dependent demands and indirect network effects. These “platform markets” include such examples as the classic “razors & blades” and “hardware & software” cases (e.g., Ohashi, 2003; Nair et al., 2004; Hartmann and Nair, 2010; Lee, 2013; Howell and Allenby, 2017).

4.2.1 Complementarity between products within a commodity group In most marketing models of consumer demand, products within a commodity group are assumed to be substitutes. When a single product in the commodity group is purchased on a typical trip, the perfect substitutes specification is used (see Sections 3.2 and 3.3). However, even when multiple products are purchased on a given trip, additive models that imply products are strict substitutes are still used (see for instance the translated CES model in Section 3.1.4). Even though some consumer goods products are purchased jointly, they are typically assumed to be consumed separately. There are of course exceptions. The ability to offer specific bundles of varieties of beverage flavors or beer brands could have a complementary benefit when a consumer is entertaining guests. Outside the CPG domain, Gentzkow (2007) analyzed the potential complementarities of jointly consuming digital and print versions of news.



CHAPTER 1 Microeconometric models of consumer demand

To incorporate the definition of complementary goods into our demand framework, we begin with a discrete quantity choice model of utility over j = 1, ..., J goods in a commodity group, where xj ∈ {0, 1}, and a J + 1 essential numeraire. The goal consists of testing for complementarity between the goods within a given commodity group. The discussion herein closely follows Gentzkow (2007). We index all the possible commodity-group bundles the consumer could potentially purchase as c ∈ P ({1, ..., J }), using c = 0 to denote the allocation of her entire budget to the numeraire. The consumer obtains the following choice-specific utility, normalized by u0 uc =


 j ∈c

 ψj − αpj + εj + 12 j ∈c k∈c,k =j j k ,


if c ∈ P ({1, ..., J }) if c = 0 (51)

where  is symmetric and P ({1, ..., J }) is the power set of the j = 1, ..., J products. Assume that ε ∼ N (0, ). To simplify the discussion, suppose that the commodity group comprises only two goods, j and k. The choice probabilities are then $

  ε|uj ≥0,uj ≥uk ,uj ≥uj k dF (ε) Pr (k) = ε|uk ≥0,uk ≥uj ,uk ≥uj k  dF (ε) $ Pr (j k) = ε|uj k ≥0,uj k ≥uj ,uj k ≥uk  dF (ε) .

Pr (j ) =


Finally, the expected consumer demand can be computed as follows: xj = Pr (j ) + Pr (j k) and xk = Pr (k) + Pr (j k). It is straightforward to show that an empirical test of complementarity between two goods, j and k, reduces to the sign of the corresponding j k elements of . An increase in the price pk has two effects on demand xj . First, marginal consumers who would not buy the bundle but who were indifferent between buying only j or only k alone will switch to j . At the same time, however, marginal consumers who would not buy only j or only k, and who are indifferent between buying the bundle or not, will switch to non-purchase. More formally, ∂xj ∂Pr (j ) ∂Pr (j k) = + ∂pk ∂pk ∂pk  = 

ε|uj =uk ,uk ≥0,−kj ≥uj

 dF (ε) −

  dF ε|uj +uk =−j k ,uj ≤0,uk ≤0

(ε) .

We can see that our test for complementarity is determined by the sign of j k >0⇒ =0⇒  0 and j and k are substitutes.

4 Some extensions to the typical neoclassical specifications

Gentzkow (2007) provides a practical discussion of the identification challenges associated with j k , even for this stylized discrete choice demand system. At first glance, (51) looks like a standard discrete choice model where each of the possible permutations of products has been modeled as a separate choice.26 But, the correlated error structure in ε plays an important role in the identification of the complementarity. The key moment for the identification of  is the incidence of joint purchase of products j and k, Pr (j k). But,  Pr (j k) could arise either through a high value of  high j k or a high value . A restricted covariance structure like logit, which of cov ε , ε j k  sets cov εj , εk = 0 will be forced to attribute a high Pr (j k) to complementarity. An ideal instrument for testing complementarity would be an exclusion restriction. Consider for instance a variable zj that shifts the mean utility for j but does not affect  or the mean utility of good k. In the CPG context, the access to highfrequency price variation in all the observed products as well as point-of-purchase promotional variables are ideal for this purpose. The identification of  could then reflect the extent to which changes in zj affect demand xk .   Panel data can also be exploited to identify j k and cov εj , εk . Following the conventions in the literature allowing for persistent, between-consumer heterogeneity, we could let ε be persistent, consumer-specific “random effects.” We could then also include i.i.d. shocks that vary across time and product to explain within consumer switches in behavior. If joint purchase reflects cov εj , εk , we would expect to see some consumers frequently buying both and other consumers frequently buying neither. But, conditional on a consumer’s average propensity to purchase either good, the cross-time variation in choices should be uncorrelated. However, if joint purchase reflects j k , we would then expect more correlation over time whereby a consumer would either purchase both goods or neither, but would seldom purchase only one of the two.

4.2.2 Complementarity between commodity groups (multi-category models) In the analysis of CPG data, most of the emphasis on complementarity has been between product categories, where products within a commodity group are perceived as substitutes but different commodity groups may be perceived as complements. Typically, such cross-category models have been specified using probabilistic choice models without a microeconomic foundation, and that allow for correlated errors either in the random utility shocks (e.g., Manchanda et al., 1999; Chib et al., 2002) or in the persistent heteroskedastic shocks associated with random coefficients (e.g., Ainslie and Rossi, 1998; Erdem, 1998). For an overview of these models, see the discussion in Seetharaman et al. (2005). The lack of a microfoundation complicates the ability to assign substantive interpretations of model parameters. For instance, the identification discussion in the previous section clarifies the fundamental distinction

26 For instance, Manski and Sherman (1980) and Train et al. (1987) use logit and nested logit specifications that restrict the covariance patterns in ε.



CHAPTER 1 Microeconometric models of consumer demand

between correlated tastes (as in these multi-category probabilistic models) and true product complementarity. At least since Song and Chintagunta (2007), the empirical literature has used microfounded demand systems to accommodate complementarity and substitutability in the analysis of the composition of household shopping baskets spanning many categories during a shopping trip. Conceptually, it is straightforward to extend the general, non-additive frameworks in Section 3 to many commodity groups. For instance, Bhat et al. (2015) introduce potential complementarity into the translated CES specification (see Section 3.1.4) by relaxing additivity and allowing for interaction effects.27 In their study of the pro-competitive effects of multiproduct grocery stores, Thomassen et al. (2017) use a quadratic utility model that allows for gross complementarity.28 Mehta (2015) uses the indirect translog utility approximation to derive a multi-category model that allows for complementarities. The direct application of the models in Section 3 and the extensions just discussed is limited by the escalation in parameters and the dimension of numerical integration, both of which grow with the number of products studied. Typically, researchers have either focused their analysis on a small set of product alternatives within a commodity group29 or have focused their analysis on aggregate expenditure behavior across categories, collapsing each category into an aggregated composite good.30 As we discuss in the next subsection, additional restrictions on preferences have been required to accommodate product-level demand analysis across categories.

Example: Perfect substitutes within a commodity group Suppose the consumer makes purchase decisions across m = 1, ..., M commodity groups, each containing j = 1, ..., Jm products. The consumer has a weakly separable, multivariate utility function over each of the M commodity groups and an M + 1 essential numeraire good with price pM+1 = 1. Within each category, the consumer has perfect substitutes sub-utility over the products, giving consumer utility ⎛ U˜ ⎝


j =1

ψ1j x1j , ...,


⎞ ψMj xMj , ψM+1 xM+1 ⎠


j =1

27 This specification does not ensure that global regularity is satisfied, which could limit the ability to

conduct counterfactual predictions with the model. 28 Empirically, they find that positive cross-price elasticities between grocery categories within a store are

driven more by shopping costs associated with store choice than by intrinsic complementarities based on substitution patterns between categories. 29 For instance, Wales and Woodland (1983) allow for J = 3 alternatives of meat: beef, lamb, and other meat, and Kim et al. (2002) allow for J = 6 brand alternatives of yogurt. 30 For instance, Kao et al. (2001) look at expenditures across J = 7 food commodity groups, Mehta (2015) looks at trip-level expenditures across J = 4 supermarket categories, and Thomassen et al. (2017) look at trip-level expenditures across J = 8 supermarket product categories.

4 Some extensions to the typical neoclassical specifications

and budget constraint Jm M

pmj xmj + xM+1 ≤ y.

m=1 j =1

The utility function is a generalization of the discrete choice specification in Section 3.2 to many commodity groups. At most, one product will be chosen in each of the M commodity groups. As before, ψmj ≥ 0 and U˜ () is continuouslydifferentiable, quasi-concave, and increasing function in each of its arguments. We also assume additional regularity conditions to ensure that an interior quantity of ∗ > 0. This approach with perfect the essential numeraire is always purchased, xM+1 substitutes within a category has been used in several studies (e.g., Song and Chintagunta, 2007; Mehta, 2007; Lee and Allenby, 2009; Mehta and Ma, 2012). Most of the differences across studies are based on the assumptions regarding the multivariate utility function U˜ (x). Lee and Allenby (2009) use a primal approach that specifies a quadratic utility over the commodity groups U˜ (u (x1 ; ψ1 ) , ..., u (xM ; ψM ) , ψM+1 xM+1 ) =


βm0 u (xm ; ψm ) −


M+1 M+1 1

βmn u (xm ; ψm ) u (xn ; ψn ) 2


m=1 n=1

where u (xm ; ψm ) =


ψmj xmj

j =1

and   ψmj = exp ψ¯ mj + εmj where ψ¯ M+1 = 0, we normalize β10 , and we assume symmetry such that βmn = βnm for m, n = 1, ..., M + 1. The KKT conditions associated with the maximization of the utility function (53) are now as follows: ∗ >0 ε˜ mj = hmj (x∗ ; ψ) , if xmj ∗ =0 ε˜ mj ≤ hmj (x∗ ; ψ) , if xmj

where ε˜ mj = εmj − εM+1 and hmj

ψ¯ mj x ; ψ = − ln pmj

βm0 −



βmn u

xn∗ ; ψn




CHAPTER 1 Microeconometric models of consumer demand

Lee and Allenby (2009) do not impose additional parameter restrictions to ensure the utility function is quasi-concave, a sufficient condition for the coherency of the likelihood function. Instead, they set the likelihood deterministically to zero at any support point where either the marginal utilities are negative or the utility function fails quasi-concavity.31 While this approach may ensure coherency, it will not ensure that global regularity is satisfied, which could limit the ability to conduct counterfactual predictions with the model. While products in the same commodity group are assumed to be perfect substitutes, the utility function (53) allows for gross complementarity between a pair of commodity groups, m and n, through the sign of the parameter βmn . In their empirical application, they find gross complementarity between laundry detergents and fabric softeners, which conforms with their intuition. All other pairs of categories studied are found to be substitutes. Song and Chintagunta (2007), Mehta (2007), and Mehta and Ma (2012) use a dual approach that specifies a translog approximation of the indirect utility function. For simplicity of presentation, we use the homothetic translog specification from Song and Chintagunta (2007)32 V (p, y; θ, ε) = ln (y) −


 θm ln



1 2



pmjm ψmjm 

θmn ln

m=1 n=1

pmjm ψmjm


pnjm ψmjn



 εjm ln


pmjm ψmjm

(54) ψ


mjm mj where for each commodity group m, product jm satisfies pmj ≥ pmj , ∀j = jm . To m ensure the coherency of the model, the following parameter restrictions are imposed:

m θm

= 1, θmn = θnm , ∀m, n

M+1 m=1

θmn = 0, ∀n.

Applying Roy’s identity, we derive the following conditional expenditure shares smjm (p, y; θ, ε) = θm −



 θmn ln

pmjm ψmjm



31 The presence of indicator functions in the likelihood create discontinuities that could be problematic

for maximum likelihood estimation. The authors avoid this problem by using a Bayesian estimator that does not rely on the score of the likelihood. 32 Mehta and Ma (2012) use a non-homothetic translog approximation which generates a more complicated expenditure share system, but which allows for more flexible income effects.

4 Some extensions to the typical neoclassical specifications

Since the homothetic translog approximation in (54) eliminates income effects from the expenditure shares, a test for complementarity between a pair of categories m and n amounts to testing the sign of θmn . In particular, conditional on the chosen products in categories m and n, jm and jn , respectively, complementarity is identified p off the changes in smjm due to the quality-adjusted price, ψnjnjn . Hence, several factors n can potentially serve as instruments to test complementarity. Changes in the price pnjn is an obvious source. In addition, if the perceived quality ψnjn is projected onto observable characteristics of product jn , then independent characteristic variation in product jn can also be used to identify the complementarity. The switching conditions will be important to account for variation in the identity of the optimal product jn . A limitation of this specification is that any complementarity only affects the intensive quantity margin and does not affect the extensive brand choice and purchase incidence margins. Song and Chintagunta (2007) do not detect evidence of complementarities in their empirical application, which may be an artifact of the restricted way in which complementarity enters the model. Mehta and Ma (2012) use a non-homothetic translog approximation that allows for complementarity in purchase incidence as well as the expenditure shares. In their empirical application, they find strong complementarities between the pasta and pasta sauces categories. These findings suggest that the retailer should be coordinating prices across the two categories and synchronizing the timing of promotional discounts.

4.3 Discrete package sizes and non-linear pricing In many consumer goods product categories, product quantities are restricted to the available package sizes. For instance, a customer must choose between specific prepackaged quantities of liquid laundry detergent (e.g., 32 oz, 64 oz, or 128 oz) and cannot purchase an arbitrary, continuous quantity. Early empirical work focused on brand choices, either narrowing the choice set to a specific pack size or collapsing all the pack sizes into a composite brand choice alternative. However, these models ignore the intensive quantity margin and limit the scope of their applicability to decision-making on the supply side. Firms typically offer an array of pre-packaged sizes as a form of commodity bundling, or “non-linear pricing.” In practice, we expect to see quantity discounts whereby the consumer pays a lower price-per-unit when she buys the larger pack size, consistent with standard second-degree price discrimination (e.g., Varian, 1989; Dolan, 1987). However, several studies have documented cases where firms use quantity-surcharging by raising the price-per-unit on larger pack sizes (e.g., Joo, 2018). The presence of nonlinear pricing introduces several challenges into our neoclassical models of demand (e.g., Howell et al., 2016; Reiss and White, 2001). First, any kinks in the pricing schedule will invalidate the use of the Kuhn-Tucker conditions.33 Second, the dual approach that derives demand using Roy’s identity is 33 See Lambrecht et al. (2007) and Yao et al. (2012) for the analysis of demand under three-part mobile tariffs.



CHAPTER 1 Microeconometric models of consumer demand

invalidated by non-linear pricing because Roy’s Identity only holds under a constant marginal price for any given product. An exception is the case of piecewiselinear budget sets (e.g., Hausman, 1985; Howell et al., 2016). Third, the price paid per unit of a good depends on a consumer’s endogenous quantity choice, creating a potential self-selection problem in addition to the usual non-negativity problem. To see this potential source of bias, note that the consumer’s budget constraint is j pj xj xj ≤ y, so the price paid is endogenous as it will depend on unobservable (to the researcher) aspects of the quantity demanded by the consumer.

4.3.1 Expand the choice set One simple and popular modeling approach simply expands the choice set to include all available combinations of brands and pack sizes (e.g., Guadagni and Little, 1983). A separate random utility shock is then added to each choice alternative. Suppose the consumer makes choices over the j = 1, ..., J products in a commodity group where each product is available in a finite number of pre-packaged sizes, a ∈ Aj . If the consumer has additive preferences and the j = 1, ..., J products are J perfect substitutes, U (x, xJ +1 ) = u1 j =1 ψj xj + u2 (xJ +1 ), her choice-specific indirect utilities are     vj a = u1 ψj xj a + u2 y − pj a + εj a , j = 1, ..., J, a ∈ Aj (56) vJ +1 = u2 (y) + εJ +1 where εaj ∼ i.i.d. EV (0, 1), which allows for random perceived utility over the pack size variants of a given product. The probability of choosing pack size a for product k is then P r (ka; θ ) =

exp (u1 (ψk xka ) + u2 (y − pka ))    . (57)   ψ + u y − p exp (u2 (y)) +  exp u x 1 j j a 2 j a  j =1,..,J |a∈Aj

To see how one might implement this model in practice, the consumer has   assume Cobb-Douglas utility. In this case, u1 (ψk xka ) = α1 ln ψ¯ k + α1 ln (xka ) where α1 is the satiation rate over the commodity group. Conceptually, this model could be expanded even further to allow the consumer to purchase bundles of the products to configure all possible quantities that are feasible within the budget constraint. An important limitation of this specification is that it assigns independent random utility to each pack size of the, otherwise, same product. This assumption would make sense if, for instance, a large 64-oz plastic bottle of soda is fundamentally different than a small, 12-oz aluminum can of the same soda. In other settings, we might expect a high correlation in the random utility between two pack-size variants of an otherwise identical product (e.g., 6 packs versus 12 packs of aluminum cans of soda). Specifications that allow for such correlation within-brand, such as nested logit, generalized extreme value or even multinomial probit could work. But, in a

4 Some extensions to the typical neoclassical specifications

setting with many product alternatives, it may not be possible to estimate the full covariance structure between each of the product and size combinations. In some settings, the temporal separation between choices can be used to simplify the problem. For instance, Goettler and Clay (2011) and Narayanan et al. (2007) study consumers’ discrete-continuous choices on pricing plan and usage. For instance, providers of mobile data services and voice services typically offer consumers choices between pricing plans that differ in their convexity. In practice, we might not expect the consumer to derive marginal utility from the convexity of the pricing plan, seemingly rendering the pricing plan choice deterministic. But, if the consumer makes a discrete choice between pricing plans in expectation of future usage choices, the expectation errors can be used as econometric uncertainty in the discrete choice between plans.

4.3.2 Models of pack size choice Allenby et al. (2004) use the following Cobb-Douglas utility specification U (x, xJ +1 ) =


    α1 ln ψj xka + α2 ln y − pj a

j =1 a∈Aj

  where ψj = exp ψ¯ j + εj and εj ∼ i.i.d. EV (0, 1). In this specification, the utilities of each of the pack sizes for a given product are perfectly correlated. The corresponding, optimal pack size choice for a given product j is deterministic:    aj∗ = max α1 ln (xka ) + α2 ln y − pj a a∈Aj

and does not depend on ψj . The consumer’s product choice problem is then the usual maximization across the random utilities of each of the j = 1, ..., J products corresponding to their respective optimal pack sizes choices. The probability of observing the choice of product k is then 

Pr k; θ, a

     exp α1 ψ¯ k + α1 ln xak ∗ + α2 ln y − pak ∗      = J ¯ j =1 exp α1 ψj + α1 ln xaj∗ + α2 ln y − paj∗


where ak is the observed pack size chosen for brand k. One limitation of the pack size demand specification (58) is that the corresponding likelihood will not have full support. In particular, variation between pack sizes of a given brand, all else equal, will reject the model. In a panel data version of the model with consumer-specific parameters, within-consumer switching between pack sizes of the same brand over time, all else equal, would reject the model. Goettler and Clay (2011) and Narayanan et al. (2007) propose a potential solution to this issue, albeit in a different setting. The inclusion of consumer uncertainty over future quantity needs allows for random variation in pack sizes.



CHAPTER 1 Microeconometric models of consumer demand

5 Moving beyond the basic neoclassical framework 5.1 Stock-piling, purchase incidence, and dynamic behavior The models discussed so far have treated the timing of purchase as a static consumer decision. According to these models, a consumer allocates her entire budget to the essential numeraire (or outside good), when all of the products’ prices exceed their corresponding reservation utilities, as in Eq. (5) above. Indeed, the literature on price promotions has routinely reported a large increase in sales during promotion weeks (see for instance the literature survey by Blattberg and Neslin (1989) and the empirical generalizations in Blattberg et al. (1995)). However, in the case of storable products, consumers may accumulate an inventory and time their purchases strategically based on their expectations about future price changes. An empirical literature has found that price promotions affect both the quantity sold and the timing of purchases through purchase acceleration (e.g., Blattberg et al., 1981; Neslin et al., 1985; Gupta, 1991; Bell et al., 1999). This work estimates that purchase acceleration accounts for between 14 and 50 percent of the promotion effect on quantities sold. Purchase acceleration could simply reflect an increase in consumption. However, more recent work finds that the purchase acceleration could reflect strategic timing based on price expectations. Pesendorfer (2002) finds that while the quantity of ketchup sold is generally higher during periods of low prices, the level depends on past prices. Hendel and Nevo (2006b) find that the magnitude of the total sales response to a price discount in laundry detergent is moderated by the time since the last price discount. The quantity sold increases by a factor of 4.7 if there was not a sale in the previous week, but only by a factor of 2.0 if there was a sale in the previous week. Using household panel data, Hendel and Nevo (2003) also detect a post-promotion dip in sales levels. Looking across 24 CPG categories, Hendel and Nevo (2006b) find that households pay 12.7% less than if they paid the average posted prices. Collectively, these findings suggest that households may be timing their purchases strategically to coincide with temporary price discounts. In this case, a static model of demand may over-estimate the own-price response. The potential bias on cross-price elasticities is not as clear. In the remainder of this section, we discuss structural approaches to estimate demand with stock-piling and strategic purchase timing based, in part, on price expectations. These models can be used to measure short and longterm price response through counterfactuals. To the best of our knowledge, Blattberg et al. (1978) were the first to propose a formal economic model of consumer stock-piling based on future price expectations. In the spirit of Becker’s (1965) household production theory, they treat the household as a production unit that maintains a stock of market goods to meet its consumption needs. While the estimation of such a model exceeded the computing power available at that time, Blattberg et al. (1978) find that observable household resources, such as home ownership, and shopping costs, such as car ownership and dual-income status, are strongly associated with deal-proneness. In the macroeconomics literature, Aguiar and Hurst (2007) extend the model to account for the time allocation between

5 Moving beyond the basic neoclassical framework

“shopping” and “household production” to explain why older consumers tend to pay lower prices (i.e. find the discount periods). In the following sub-sections, we discuss more recent research that has estimated the underlying structure of a model of stock-piling.

5.1.1 Stock-piling and exogenous consumption Erdem et al. (2003) build on the discrete choice model formulation as in Section 3.3. Let t = 1, ..., T index time periods. At the start of each time period, a consumer has inventory it of a commodity and observes the prices pt . The consumer can endogenously increase her inventory by purchasing quantities xj kt of each of the j products, where k ∈ {1, ..., K} indexes the discrete set of available pack sizes. Denote the nonpurchase decision as x0t = 0. Assume the consumer incurs a shopping cost if she chooses to purchase at least one of the products: F (xt ; τ ) = τ I{ J xj t >0} . Her total j =1 post-purchase inventory in period t is: it = it + Jj=1 xj t . After making her purchase decision, the consumer draws an exogenous consumption need that is unobserved to the analyst, ωt ∼ Fω (ω), and that she con 34 sumes from  her  inventory, it . Her total consumption during period t is therefore ct = min ωt , it , which is consumed at a constant rate throughout the period.35 Assume also that the consumer is indifferent between the brands in her inventory when she consumes the commodity and that she consumes each of them in constant proportion: cj t = ci t ij t . t If the consumer runs out of inventory before the end of the period, ωt > ct , she incurs a stock-out cost SC (ωt , ct ; λ) = λ0 + λ1 (ωt − ct ). The consumer also incurs an inventory carrying cost each period based on the total average inventory held during the period. Her average inventory in period # ω it − 2t , ωt ≤ it t is i¯t = i  i  . t t ωt > it ωt 2 ,   Her total inventory carrying cost is given by I C it , ωt ; δ = δ0 i¯t + δ1 i¯t2 . Assume the consumer has the following perfect substitutes consumption utility function each period:  J  

U it , i˜t , pt , ωt ; θ = ψj cj t + ψJ +1 y − pj kt xj kt − F (xt ; τ ) j =1


   − SC (ωt , ct ; λ) − I C it , ωt ; δ

34 The consumer therefore makes purchase decisions in anticipation of her future expected consumption

needs, as in Dubé (2004). 35 Sun et al. (2003) use a data-based approach that measures the exogenous consumption need as a con-

stant consumption rate, based on the household’s observed average quantity purchased.



CHAPTER 1 Microeconometric models of consumer demand


c it , ωt  ˜ = pj kt − F (xt ; τ ) it + ψJ +1 y −  it j,k        − SC ωt , c it , ωt ; λ − I C it , ωt ; δ


where i˜t ≡ Jj=1 ψj ij t is the post-purchase quality-adjusted inventory, and where the shopping cost, inventory carrying cost, and stock-out cost have all been subsumed into the budget constraint. The vector θ = (ψ1 , ..., ψJ +1 , λ0 , λ1 , δ0 , δ1 , τ0 ) contains all the model’s parameters.   The three state variables are summarized by st = it , i˜t , pt . The inventory state variables evolve as follows:    it = it−1 + Jj=1 xj t 1 − ci t t

  i˜t = i˜t−1 + Jj=1 ψj xj t 1 −

ct it


Assume in addition that consumers’ price beliefs are known to the analyst and evolve according to the Markov transition density36 pt+1 ∼ fp (pt+1 |pt ).37 Therefore, the state  vector also follows a Markov Process which we denote by the transition density fs s  |s, xj k . The consumer’s purchase problem is dynamic since she can control her future inventory states with her current purchase decision. Assuming the consumer discounts future utility at a rate β ∈ (0, 1), the value function associated with her purchase decision problem in state st is   v (st , εt ) = max vj k (st ; θ ) + εj kt (60) j,k

where εj kt ∼ i.i.d. EV (0, 1) is a stochastic term known to the household at time t but not to the analyst. vj k (s) is the choice-specific value function associated with choosing product j and pack size k in state s         vj k (s; θ ) = U (st , ω; θ ) fω (ω) dω + β v s  , ε fs s  |s, xj k d s  , ε . (61) When the taste parameters θ are known, the value functions in (60) and (61) can be solved numerically (see for instance Erdem et al. (2003) for technical details). 

36 Typically, a rational expectations assumption is made and the price process, F p p t+1 |pt , is estimated

in a first stage using the observed price series. An interesting exception is Erdem et al. (2005) who elicit consumers’ subjective price beliefs through a consumer survey. 37 In four CPG case studies, Liu and Balachander (2014) find that a proportional hazard model for the price process fits the price data better and leads to a better fit of the demand model when used to capture consumers’ price expectations.

5 Moving beyond the basic neoclassical framework

Suppose the researcher observes a consumer’s sequence of choices, xˆ = (xˆ1 , ..., xˆT ). Conditional on the state, s, the probability that the consumer’s optimal choice is product j and pack size k has the usual multinomial logit demand form:   exp vj k (s; θ ) Pr xj k |s; θ =   , xj k ∈ {x11 , ..., xJ K } . (62) exp (v0 (s; θ )) + Jk=1 exp vj k (s; θ ) 

˜ To accommodate the fact that the two inventory state variables,i and  i, are not observed, we partition the state as follows: s = (p, s˜ ) where s˜ = i, i˜ . Since we do not observe the initial values of s˜0 , we have a classic initial conditions problem (Heckman, 1981). We resolve the initial conditions problem by assuming there is a true initial state, s˜0 , with density fs (˜s0 ; θ ). We can now derive the density associated with the consumer’s observed sequence of purchase decisions, xˆ : 

f xˆ ; θ =

⎞ ⎛ T   I{x =xˆt } ⎝ P r xj k |pt , s˜t , ω, s˜0 ; θ j k fω (ω) dω⎠ fs (˜s0 ; θ ) d s˜0 . t=1


(63) Consistent estimates of the parameters θ can then be obtained via simulated maximum likelihood.

5.1.2 Stock-piling and endogenous consumption The model in the previous section assumed an exogenous consumption rate, which implies that any purchase acceleration during a price discount period reflects stockpiling. Sun (2005), Hendel and Nevo (2006a), and Liu and Balachander (2014) allow for endogenous consumption. This important extension allows for two types of response to a temporary price cut. In addition to stock-piling, consumers can potentially increase their consumption of the discounted product. We focus on the formulation in Hendel and Nevo (2006a), which reduces part of the computational burden of Erdem et al. (2003) by splitting the likelihood into a static and a dynamic component. A key assumption is that consumers only value the brand at the time of purchase so that the optimal consumption decisions are independent of specific brands and depend only on the quantities purchased. During period t, the consumer derives the following consumption utility38

38 Hendel and Nevo (2006a) also allow for point-of-purchase promotional advertising, like feature ads and displays, to shift utility in their specification.



CHAPTER 1 Microeconometric models of consumer demand

U (ct + ωt , it , pt ; θ ) = u (ct + ωt ; γc ) +


j =1 k=1

    Ixj kt >0 ψJ +1 y − pj kt + ψj k − C (it+1 ; λc ) (64)

where as before we index the products by j = 1, ..., J and the discrete pack sizes available by k ∈ {1, ..., K}. u (c + ω; γc ) is the consumption utility with taste paramis a random “consumption need” shock. The start of period eters γc and ω ∼ Fω (ω) inventory is it = it−1 + j k xj kt−1 − ct−1 . C (it+1 ; λc ) is the inventory carrying cost with cost-related parameters λc . As before, x denotes a purchase quantity (as opposed to consumption), and J +1 captures the marginal utility of the numeraire good. The three state variables are summarized by st = (it , pt , ωt ). Inventory evolves as follows: it = it−1 + Jj=1 xj t−1 − ct−1 . In addition, consumers’ form Markovian price expectations: pt+1 ∼ Fp (pt+1 |pt ). Unlike Erdem et al. (2003), the consumption need ω ∼ Fω (ω) is a state variable. The state variables, st , follow a Markov Process with transition density fs s  |s, xj k . The value function associated with the consumer’s purchase decision problem during period t is   (65) v (st , εt ) = max vj k (st ) + εj kt j,k

where st is the state in period t, εj kt ∼ i.i.d. EV (0, 1) is a stochastic term known to the household at time t but not to the analyst, and vj k (s) is the choice-specific value function associated with choosing product j and pack size k in state s     vj k (st ) = ψJ +1 y − pj kt + ψj k + M st , xj k ; θc


$ where M(st , xj k ; θc ) = max{u(c + ωt ; γc ) − C(it+1 ; λc ) + β v(s  , ε)fs (s  |s, xj k , c) c

fε (ε)d(s  , ε)} and θc are the consumption-related parameters. Hendel and Nevo (2006a) propose a simplified three-step approach to estimating the model parameters. The value function (65) can be simplified by studying Eq. (66), which indicates that consumption is only affected by the quantity purchased, not the specific brand chosen. In a first step, it is straightforward to show that consistent estimates of the brand taste parameters, ψ = (ψ1 , ..., ψJ +1 ) can be obtained from the following standard multinomial logit model of brand choice across all brands available in the pack size k:     exp ψJ +1 y − pj kt + ψj k . Pr (j |st , k; ψ) = i exp (ψJ +1 (y − pikt ) + ψik )

5 Moving beyond the basic neoclassical framework

In a second step, define the expected value of the optimal brand choice, conditional on pack size, as follows: ⎫ ⎧ ⎨

  ⎬  ηkt = ln (67) exp ψJ +1 y − pj kt + ψj k . ⎭ ⎩ j

Using an idea from Melnikov (2013), assume that ηt−1 is a sufficient statistic for ηt so that F (ηt |st−1 ) can be summarized by F (ηt |ηt−1 ). The size-specific inclusive values can then be computed with the brand taste parameters, ψ, and Eq. (67) and then used to estimate the distribution F (ηt |ηt−1 ). In a third step, the quantity choice state can then be defined as s˜t = (it , ηt , ωt ), which reduces dimensionality by eliminating any brand-specific state variables. The value function associated with the consumer’s quantity decision problem can now be written in terms of these size-specific “inclusive value” terms:  v (˜st , εt ) = max u (c + ωt ; γc ) − C (it+1 ; λc ) + ηkt c,k  (68)        + β v s˜  , ε fs s˜  |˜s , xk , c fε (ε) d s˜  , ε . Similarly, the pack-size choice-specific value functions can also be written in terms of these size-specific “inclusive value” terms: vk (˜st ) = ηkt + Mk (˜st ; θc )


$      where Mk (˜st ; θc ) = max u (c + ωt ; γc ) − C (it+1 ; λc ) + β v s˜  , ε fs s˜  |˜s , xk , c c   fε (ε) d s˜  , ε . The corresponding optimal pack size choice probabilities are then: exp (ηk + Mk (˜st ; θc )) Pr (k|˜st ; θc ) = . st ; θc )) i exp (ηi + Mi (˜ The density associated with the consumer’s observed sequence of pack size decisions, xˆ , is39 :

  T      P r xˆt |˜st , s˜0 ; θc fω (ω) dω fω (ω) fs (˜s0 ; θc ) d (ω, s˜0 ) . f xˆ ; θc = t=1

(70) Consistent estimates of the parameters θc can then be obtained via simulated maximum likelihood. A limitation of this three-step approach is that it does not allow for persistent, unobserved heterogeneity in tastes. 39 The initial conditions can be resolved in a similar manner as in Erdem et al. (2003).



CHAPTER 1 Microeconometric models of consumer demand

5.1.3 Empirical findings with stock-piling models In a case study of household purchases of Ketchup, Erdem et al. (2003) find that the dynamic stock-piling model described above fits the data well in-sample, in particular the timing between purchases. Using simulations based on the estimates, they find that a product’s sales-response to a temporary price cut mostly reflects purchase acceleration and category expansion, as opposed to brand switching. This finding is diametrically opposite to the conventional wisdom that “brand switchers account for a significant portion of the immediate increase volume due to sales promotion” (Blattberg and Neslin, 1989, p. 82). The cross-price elasticities between brands are found to be quite small compared to those from static choice models; although the exact magnitude is sensitive to the specification of the price process representing consumers’ expectations. They find much larger cross-price elasticities in response to permanent price changes and conclude that long-run price elasticities are likely more relevant to policy analysts who want to measure the intensity of competition between brands. In case studies of Packaged Tuna and Yogurt, Sun (2005) finds that consumption increases with the level of inventory and decreases in the level of promotional uncertainty. While promotions do lead to brand-switching, she also finds that they increase consumption. A model that assumes exogenous consumption over-estimates the extent of brand switching. In a case study of Laundry Detergents, Hendel and Nevo (2006a) focus on the long-run price elasticities by measuring the effects of permanent price changes. They find that a static model generates 30% larger price elasticities than the dynamic model. They also find that the static model underestimates cross-price elasticities. Some of the cross-price elasticities in the dynamic model are more than 20 times larger than those from the static model. Finally, the static model overestimates the degree of substitution to the outside good by 200%. Seiler (2013) builds on Hendel and Nevo’s (2006a) specification by allowing consumers with imperfect price information to search each period before making a purchase decision. In a case study of laundry detergent purchases, Seiler’s (2013) parameter estimates imply that 70% of consumers do not search each period. This finding highlights the importance of merchandizing efforts, such as in-store displays, to help consumers discover low prices. In addition, by using deeper price discounts, a firm can induce consumers to engage in more price search which can increase total category sales. This increase in search offsets traditional concerns about inter-temporal cannibalization due to strategic purchase timing.

5.2 The endogeneity of marketing variables The frameworks discussed thus far focus entirely on the demand side of the market. However, many of the most critical demand-shifting variables at the point of purchase consist of marketing mix variables such as prices and promotions, including merchandizing activities like temporary discounts, in-aisle displays, and feature advertising. If these marketing variables are set strategically by firms with more consumer in-

5 Moving beyond the basic neoclassical framework

formation than the researcher, any resulting correlation with unobserved components of demand could impact the consistency of the likelihood-based estimates discussed thus far. In fact, one of the dominant themes in the empirical literature on aggregate demand estimation consists of the resolution of potential endogeneity of supply-side variables (e.g., Berry, 1994; Berry et al., 1995). While most of the literature has focused on obtaining consistent demand estimates in the presence of endogenous prices, bias could also arise from the endogeneity of advertising, promotions, and other marketing variables.40 Surprisingly little attention has been paid to the potential endogeneity of marketing variables in the estimation of individual consumer level demand. In the remainder of this section, we focus on the endogeneity of prices even though many of the key themes would readily extend to other endogenous demandshifting variables.41 Suppose a sample of i = 1, ..., N consumers each makes a discrete choice among j = 1, ..., J product alternatives and a J + 1 “no purchase” alternative. Each consumer is assumed to obtain the following conditional indirect utility from choice j : Vij = ψj − αpij + εij ViJ +1 = εi,J +1 where εi ∼ i.i.d. F (ε) and pij is the price charged to consumer i for alternative j . Demand estimation is typically carried out by maximizing the corresponding likelihood function: Pr (j ; θ )yij (71) L (θ|y) = i


  where Pr (j ; θ ) ≡ Pr Vij ≥ Vik , ∀k = j and y = (yi1 , ..., yiJ +1 ) indicates which of the j = 1,..., J + 1 products was chosen by consumer i. If cov pij , εij = 0 then the maximum likelihood estimator θ MLE based on (71) may be inconsistent since the likelihood omits information about ε. In general, endogeneity can arise in three ways (see Wooldridge, 2002, for example): 1. Simultaneity: Firms observe and condition on εi when they set their prices. 2. Self-Selection: Certain types of consumers systematically find the lowest prices. 3. Measurement Error: The researcher observes a noisy estimate of true prices, p˜ ij : p˜ ij = pij + ηij . Most of the emphasis has been on simultaneity bias whereby endogeneity arises because of the strategic pricing decisions by the firms. Measurement error is not typically discussed in the demand estimation literature. However, many databases contain 40 For instance, Manchanda et al. (2004) address endogenous detailing levels across physicians. 41 In the empirical consumption literature, the focus has been more on the endogeneity of household

incomes than the endogeneity of prices (see for instance Blundell et al., 1993). Since the analysis is typically at the broad commodity group level (e.g., food), the concern is that household budget shares are determined simultaneously with consumption quantities.



CHAPTER 1 Microeconometric models of consumer demand

time-aggregated average prices rather than the actual point-of-purchase price, which could lead to classical measurement error. To the best of our knowledge, a satisfactory solution has yet to be developed for demand estimation with this type of measurement error. In many marketing settings, endogeneity bias could also arise from the self-selection of consumers into specific marketing conditions based on unobserved (to the researcher) aspects of their tastes. For instance, unobserved marketing promotions like coupons could introduce both measurement error and selection bias if certain types of consumers are systematically more likely to find/have a coupon and use it (Erdem et al., 1999). Similarly, Howell et al. (2016) propose an approach to resolve the price self-selection bias associated with consumers choosing between non-linear pricing contracts based on observable (to the researcher) aspects of their total consumption needs. If consumers face incomplete information about the choice set, then selection could arise from price search and the formation of consumers’ consideration sets (e.g., Honka, 2014). The topics of consumer search and the formation of consideration sets are discussed in more detail in chapters in this volume on branding and on search. Finally, the potential self-selection of consumers into discount and regular prices based on their unobserved (to the researcher) potential stock-piling behavior during promotional periods in anticipation of future price increases could also bias preference estimates (e.g., Erdem et al., 2003; Hendel and Nevo, 2006a). For the remainder of this discussion, we will focus on price endogeneity associated with the simultaneity bias. Suppose that j = 1, ..., J consumer goods in a product category are sold in t = 1, ..., T static, spot markets by single-product firms playing a Bertrand-Nash pricing game. Typically, a market is a store-week since stores tend to set their prices at a weekly frequency and most categories in the store are “captured markets” in the sense that consumers likely to not base their store choices on each of the tens of thousands of prices charged across the products carried in a typical supermarket. On the demand side, consumers make choices in each market t to maximize their choice-specific utility Vij t = vj (wt , pt ; θ ) + ξj t + εij t (72) ViJ +1t = εi,J +1t where we distinguish between the exogenous point-of-purchase utility shifters, wt , and the prices, pt . In addition, we now specify a composite error term consisting of the idiosyncratic utility shock, εij t ∼ i.i.d. EV (0, 1), and the common shock, ξj t ∼ i.i.d. Fξ (ξ ), to control for potential product-j specific characteristics that are observed to the firms when they set prices, but not to the 1994).  researcher (Berry,  Consumers have corresponding choice probabilities, Pr j ; θ |wt , pt , ξ t for each of the j alternatives including the J + 1 no-purchase alternative. Price  endogeneity  arises when the firms condition on ξ when setting its prices and cov pt , ξ t = 0.

5 Moving beyond the basic neoclassical framework

A consistent and efficient estimator can be constructed by maximizing the following likelihood   ··· L (θ|y, p) = Pr (j ; θ |wt , pt , ξ )yij t fp (pt |ξ ) fξ (ξ ) dξ1 ...dξJ . t



In practice, the form of the likelihood of prices may not be known and ad hoc assumptions about fp (p|ξ ) could lead to additional specification error concerns. We now discuss the trade-offs between full-information and limited information approaches.

5.2.1 Incorporating the supply side: A structural approach An efficient “full-information” solution to the price endogeneity bias consists of modeling the data-generating process for prices and deriving the density fp (p|ξ ) structurally. Since consumer goods are typically sold in a competitive environment, this approach requires specifying the structural form of the pricing game played by the various suppliers. The joint density of prices is then induced by the equilibrium in the game (e.g., Yang et al., 2003; Draganska and Jain, 2004; Villas-Boas and Zhao, 2005). On the supply side of the model in Eq. (72), assume the J firms play a static, Bertrand-Nash game for which the prices each period satisfy the following necessary conditions for profit maximization:       ∂Pr j ; θ|pt , ξ t Pr j ; θ|wt , pt , ξ t + pj t − cj t =0 (73) ∂pj t where cj t = bj t γ + ηj t is firm j ’s marginal cost in market t, bj t are observable costshifters, like factor prices, γ are the factor weights, and ηt ∼ i.i.d. F (η) is a vector of cost shocks that are unobserved to the researcher. We use the static Nash concept as an example. Alternative modes of conduct (including non-optimal behavior) could easily be accommodated instead. In general, these first order conditions (73) will create covariance between prices and demand shocks, cov (pt , ξt ) = 0. As long as the system of first-order conditions, (73), generates a unique vector  ofequilibrium prices, we can then derive the density of prices fp p|ξ t , ct , wt =  f ηt |ξ t |Jη→p | where Jη→p is the Jacobian of the transformation from η to prices.   A consistent and efficient estimate of the parameters  = θ  , γ  can then be obtained by maximizing the likelihood function42    y L (θ |y, p) = Pr j ; θ |wt , pt , ξ t ij t fp (pt |ξ , ct , wt ) f (ξ ) dηdξ. t




42 Yang et al. (2003) propose an alternative Bayesian MCMC estimator.



CHAPTER 1 Microeconometric models of consumer demand

Two key concerns with this approach are as follows. First, in many settings, pricing conduct may be more sophisticated than the single-product, static Bertrand Nash setting characterized by (73). Mis-specification of the pricing conduct would lead to   a mis-specification of the density fp p|ξ t , which could lead to bias in the parameter estimates. Yang et al. (2003) resolve this problem by testing between several different forms of static, pricing conduct. An advantage of their Bayesian estimator is the ability to re-cast the conduct test as a Bayesian decision theory problem of model selection. Villas-Boas and Zhao (2005) incorporate conduct parameters into the system of first-order necessary conditions (73), where specific values of the conduct parameter nest various well-known pricing games. In addition to the conduct specification, even if we can assume existence of a price equilibrium, for our simple static Bertrand-Nash pricing game it is difficult to prove uniqueness to the system of first-order necessary conditions (73) for most demand   specifications, P r j ; θ |wt , pt , ξ t . This non-uniqueness problem translates into a coherency problem for the maximum likelihood estimator based on (74). The multiplicity problem would likely be exacerbated in more sophisticated pricing games involving dynamic conduct, multi-product firms, and channel interactions. Berry et al. (1995) avoid this problem by using a less efficient GMM estimation approach that does not require computing the Jacobian term Jη→p . Another potential direction for future research might be to recast (71) as an incomplete model and to use partial identification for inference on the supply and demand parameters (e.g., Tamer, 2010). A more practical concern is the availability of exogenous variables, bj t , that shift prices but are plausibly excluded from demand. Factor prices and other cost-related factors from the supply side may be available. In the absence of any exclusion restrictions, identification of the demand parameters will then rely on the assumed structure of fp (p|ξ ) f (ξ ). The full-information approaches have thus far produced mixed evidence on the endogeneity bias in the demand parameters in a small set of empirical case studies. Draganska and Jain (2004) and Villas-Boas and Zhao (2005) find substantial bias, especially in the price coefficient α. However, in a case study of light beer purchases, Yang et al. (2003) find that the endogeneity bias may be an artifact of omitted heterogeneity in the demand specification. Once they allow for unobserved demand heterogeneity, they obtain comparable demand estimates regardless of whether they incorporate supply-side information into the likelihood. Interestingly, in a study of targeted detailing to physicians, Manchanda et al. (2004) find that incorporating the supply side not only resolves asymptotic bias in the estimates of demand parameters, they also find a substantial improvement in efficiency.43

43 Manchanda et al. (2004) address a much more sophisticated form of endogeneity bias whereby the

detailing levels are coordinated with the firm’s posterior beliefs about a physician’s response coefficients, as opposed to an additive error component as in the cases discussed above.

5 Moving beyond the basic neoclassical framework

5.2.2 Incorporating the supply side: A reduced-form approach As explained in the previous section, the combination of potential specification error and a potential multiplicity of equilibria are serious disadvantages to full-information approaches. In the literature, several studies have proposed less efficient limitedinformation approaches that are more agnostic about the exact data-generation process on the supply side. Villas-Boas and Winer (1999) use a more agnostic approach that is reminiscent of two-stage least squares estimators in the linear models setting. Rather than specify the structural form of the pricing game on the supply side, they instead model the reduced form of the equilibrium prices pj t = Wj (wt , bt ; λ) + ζj t


where bt are again exogenous price-shifters  thatare excluded from the demand side, and ζ t is a random price shock such that ξ t ,ζ t ∼ F (ξt , ζt ) and  bt are  independent of ξ t , ζ t . It is straightforward to derive f p|ξ t , bt , wt = f ζ t |ξ t since the linearity obviates the need to compute A consistent “limited information”   a Jacobian. estimate of the parameters  = θ  , λ can then be obtained by substituting this density into the likelihood function (74). While this approach does not require specifying pricing conduct, unlike a two-stage least squares estimator, linearity is not an innocuous assumption. Any specification error in the ad hoc “reduced form” will potentially bias the demand estimates. For instance, the first-order necessary conditions characterizing equilibrium prices in (73) would not likely reduce to a specification in which the endogenous component of prices is an additive, Gaussian shock. Conley et al. (2008) resolve this problem by using  a semi-parametric, mixture-of-Normals approximation of the density over ξ t , ζ t . A separate stream of work has developed instrumental variables methods to handle the endogeneity of prices. Chintagunta et al. (2005) conduct a case study of product categories in which, each store-week, a large number of purchases are observed for each product alternative. On the demand side, they can then directly estimate the weekly mean utilities as “fixed effects” Vij t = vj (wt , pt ; θ ) + ξj t + εij t ViJ +1t = εi,J +1t without needing to model the supply side. Using standard maximum esti likelihood  mation techniques, they estimate the full set of brand-week effects ψj t j,t in a first stage.44 Following Nevo’s (2001) approach for aggregate data, the mean responses to marketing variables are obtained in a second stage minimum distance procedure that projects the brand-week effects onto the product attributes, xt and pt ψˆ j t = vj (wt , pt ; θ ) + ξj t


44 Their estimator allows for unobserved heterogeneity in consumers’ responses to marketing variables.



CHAPTER 1 Microeconometric models of consumer demand

using instrumental variables, (wt , bt ) to correct for the potential endogeneity of prices. Unlike Villas-Boas and Winer (1999), the linearity in (75) does not affect the consistency of the demand estimates. Even after controlling for persistent, unobserved consumer taste heterogeneity, Chintagunta et al. (2005) find strong evidence of endogeneity bias in both the levels of the response parameters and in the degree of heterogeneity. A limitation of this approach is that any small sample bias in the brand-week effects will potentially lead to inconsistent estimates. In related work, Goolsbee and Petrin (2004) and Chintagunta and Dubé (2005) use an alternative approach that obtains exact estimates of the mean brand-week utilities by combining the individual purchase data with store-level data on aggregate sales. Following Berry et al. (1995) (BLP), the weekly, mean brand-week utilities are inverted out of the observed weekly market share data, st ψ t = Pr−1 (st )


where Pr−1 (st ) is the inverse of the system of predicted market shares corresponding to the demand model.45 These mean utilities are then substituted into the first stage for demand estimation.46 In a second stage, the mean response parameters are again obtained using the projection (76) and instrumental variables to correct for the endogeneity of prices. When aggregate market share data are unavailable, Petrin and Train (2010) propose an alternative “control function” approach. On the supply side, prices are again specified in reduced form as in (75). On the demand side, consumers make choices in each market t to maximize their choice-specific utility Vij t = vj (wt , pt ; θ ) + εij t ViJ +1t = εi,J +1t where the utility shocks to the j = 1, ..., J products can be decomposed as follows: 1 2 εij t = εij t + εij t

  1 ,ζ 2 ∼ i.i.d. F (ε). We can then re-write the choicewhere εij ∼ N (0, ) and εij ij t t t specific utility as: 2 Vij t = vj (wt , pt ; θ ) + λζij t + σ ηj t + εij t , j = 1, ..., J


where ηj t ∼ N (0, 1). Estimation is then conducted in two steps. The first stage consists of the price regression based on Eq. (75). The second stage consists of estimating 45 See Berry (1994) and Berry et al. (2013) for the necessary and sufficient conditions required for the

demand system to be invertible. 46 Chintagunta and Dubé (2005) estimate the parameters characterizing unobserved heterogeneity in this

first stage.

5 Moving beyond the basic neoclassical framework

the choice probabilities corresponding to (78) using the control function, λζ for alternatives j = 1, ..., J with parameter λ to be estimated. In an application to household choices between satellite and cable television content suppliers, Petrin and Train (2010) find that the control function in (78) generates comparable demand estimates to those obtained using the more computationally and data-intensive BLP approach based on (77).

5.3 Behavioral economics The literature on behavioral economics has created an emerging area for microeconometric models of demand. This research typically starts with surprising or puzzling moments in the data that would be difficult to fit using the standard neoclassical models. In this section, we look at two specific topics: the fungibility of income and social preferences. For a broader discussion of structural models of behavioral economics, see DellaVigna (2017). The pursuit of ways to incorporate more findings from the behavioral economics literature into quantitative models of demand seems like a fertile area for future research.47

5.3.1 The fungibility of income Building on the discussion of income effects from Section 4.1, the mental accounting literature offers a more nuanced theory of income effects whereby individuals bracket different sources of income into mental accounts out of which they have different marginal propensities to consume (Thaler, 1985, 1999). Recent field studies have also found evidence of bracketing. In an in-store coupon field experiment involving an unanticipated coupon for a planned purchase, Heilman et al. (2002) find that coupons cause more unplanned purchases of products that are related to the couponed item.48 Milkman and Beshears (2009) find that the incremental online consumer grocery purchases due to coupons are for non-typical items. Similarly, Hastings and Shapiro (2013) observe a much smaller cross-sectional correlation between household income and gasoline quality choice than the inter-temporal correlation between the gasoline price level and gasoline quality choice. In related work, Hastings and Shapiro (2018) find that the income-elasticity of SNAP49 -eligible food demand is much higher with respect to SNAP benefits than with respect to cash. Each of these examples is consistent with consumers perceiving money budgeted for a product category differently from “cash.” Hastings and Shapiro (2018) test the non-fungibility of income more formally using a demand model with income effects. Consider the bivariate utility over a com47 The empirical consumption literature has a long tradition of testing the extent to which consumer de-

mand conforms with rationality by testing the integrability constraints associated with utility maximization (e.g., Lewbel, 2001; Hoderlein, 2011). 48 Lab evidence has also confirmed that consumers are much more likely to spend store gift card money on products associated with the brand of the card than unbranded gift card money (e.g. American Express), suggesting that store gift card money is not fungible with cash (Reinholtz et al., 2015). 49 SNAP refers to the Supplemental Nutrition Assistance Program, or “food stamps.”



CHAPTER 1 Microeconometric models of consumer demand

modity group, with J perfect substitutes products, and a J + 1 essential numeraire, with quadratic utility50 U (x) =


j =1

1 ψj xj + ψJ +1,1 xJ +1 − ψJ +1,2 xJ2 +1 2

  where ψj = exp ψ¯ j + εj . In the application, the goods consist of different quality grades of gasoline. In this model, incomes effects only arise through the allocation of the budget between the gasoline commodity group and the essential numeraire. ! "J pj k WLOG, if product k is the preferred good and, hence, ψ = min , then pk ψj j =1

the KKT conditions are ψk − ψJ +1,1 pk + ψJ +1,2 (y − xk pk ) pk ≤ 0.


Estimation of this model follows from Section 3.2. A simple test of fungibility consists of re-writing the KKT conditions with a different marginal utility on budget income and commodity expenditure 

ψk pk

 − ψJ +1,1 + ψJ +1,y y − ψJ +1,x xk pk ≤ 0


and testing the hypothesis H0 : ψJ +1,y = ψJ +1,x . The identification of this test relies on variation in both observed consumer income, y, and in prices, p. What is missing in this line of research is a set of primitive assumptions in the microeconomic model that leads to this categorization of different sources of purchasing power. An interesting direction for future research will consist of digging deeper into the underlying sources of the mental accounting and how/whether it changes our basic microeconomic model. For instance, perhaps categorization creates a multiplicity of budget constraints in the basic model, both financial and perceptual.

5.3.2 Social preferences As discussed in the survey by DellaVigna (2017), there is a large literature that has estimated social preferences in lab experiments. We focus herein specifically on the role of consumer’s social preference and their responses to cause marketing campaigns involving charitable giving. A dominant theme of this literature has consisted of testing whether consumer response to charitable giving campaigns reflects genuine altruistic preferences versus alternative impure altruism and/or self-interest. In a pioneering study, DellaVigna et al. (2012) conducted a door-to-door fundraising campaign to test the extent to which charitable giving is driven by a genuine preference to give (altruism or warm glow) versus a disutility from declining to 50 Hastings and Shapiro (2013) treat quantities as exogenous and instead focus on the multinomial discrete

choices problem between different goods, which are qualities of gasoline.

5 Moving beyond the basic neoclassical framework

give due, for instance, to social pressure. The field data are then used to estimate a structural model of individuals’ utility from giving that separates altruism and social pressure. Formally, total charitable giving, x consists of the sum of dollars donated to the charitable campaign either directly to the door-to-door solicitor, x1 , or, of the donor is not home at the time of the visit, she can instead make a private donation, x2 , by mail at an additional cost (1 − θ ) x2 ≥ 0 for postage, envelope, etc. All remaining wealth is spent on an essential numeraire, x3 , to capture all other private consumption. Prospective donors have a quasi-linear, bivariate utility over other consumption and charitable giving U (x1 , x2 ) = y − x1 − x2 + U˜ (x1 + θ x2 ) − s (x1 ) .


To ensure the sub-utility over giving, U˜ (x), (or “altruism” utility) has the usual monotonicity and concavity conditions, we assume U˜ (x) = ψ log ( + x) where ψ is an altruism parameter and  > 0 influences the degree of concavity.51 By allowing ψ to vary freely, the model captures the possibility of a donor who dislikes the charity. The third term in (81), s (x), represents the social cost of declining to donate or giving a small donation to the solicitor. We assume s (x) = max (0, s (g − x)) to capture the notion that the donor only incurs social pressure from donation amounts to the solicitor of less than g. To identify the social preferences, DellaVigna et al. (2012) randomize subjects into several groups. In the first group, the solicitor shows up unannounced at the prospective donor’s door. In this case, if the donor is home (with exogenous probability h0 ∈ (0, 1)), she always prefers to give directly to the solicitor to avoid the additional cost (1 − θ) of donating by mail. The total amount given depends on the relative magnitudes of ψ and the social cost s. If the donor is not home, the only reason for her to donate via mail is due to altruism. In the second group, the prospective donor is notified in advance of the solicitor’s visit with a flyer left on the door. In this case, the donor can opt out by adjusting 2 0) her probability of being home according to a cost c (h − h0 ) = (h−h 2η . The opt-out decision reflects the donor’s trade-off between the utility of donating to the solicitor, subject to social pressure costs, and donating by mail, subject to mailing costs and the cost of leaving home. In a third group, subjects are given a costless option to “opt out” by checking a “do not disturb” box on the flyer, effectively setting c (0) = 0. The authors estimate the model with a minimum distance estimator based on specific empirical moments from various experimental cells, although a maximum likelihood procedure might also have been used by including an additional random utility term into the model. While DellaVigna’s (2017) estimates indicate that donations are driven by both social costs and altruism, the social cost estimates are surprisingly large. Almost half of the sample is found to prefer not to have a solicitation, either because they prefer not to donate or to donate a small amount. The 51 Note that the marginal utility of giving d U˜ = ψ so that high  implies a slow satiation on giving. +x dx



CHAPTER 1 Microeconometric models of consumer demand

results suggest a substantial welfare loss to donors from door-to-door solicitations. Moreover, the results indicate that the observed levels of charitable giving may not reflect altruism per se. Kang et al. (2016) build on DellaVigna et al. (2012) by modeling the use of advertising creative to moderate the potential crowding-out effects of other donations by others. They estimate a modified version of 81 for a prospective donor U (x; G, θ ) = θ1 ln (x + 1) + θ2 ln (G) + θ3 ln (y − x + 1)


where G = x + x−i measures total giving to the cause and x−i represents the total stock of past donations from other donors. The authors can then test pure altruism, θ1 = 0 and θ2 > 0, versus a combination of altruism and warm-glow, θ1 > 0 and θ2 > 0 (Andreoni, 1989). The authors allow the relative role of warm glow to altruism, θθ12 , to vary with several marketing message variables. Since the preferences in (82) follow the Stone Geary functional form, demand estimation follows the approach in Section 3.1.3 above. The authors conduct a charitable giving experiment in which subjects were randomly assigned to different cells that varied the emotional appeals of the advertising message and also varied the reported amount of money donated by others. As in earlier work, the authors find that higher donations by others crowd out a prospective donor’s contribution (Andreoni, 1989). The authors also find that recipient-focused advertising messages with higher arousal trigger the impure altruism appeal, which increases the level of donations. Dubé et al. (2017b) test an alternative self-signaling theory of crowding-out effects in charitable giving based on consumer’s self-perception of altruistic preferences (Bodner and Prelec, 2002; Benabou and Tirole, 2006). Consumers make a binary purchase decision x ∈ {0, 1} for a product with a price, p, and a pro-social characteristic a ≥ 0 that measures the portion of the price that will be donated to a specific charity. Consumers obtain consumption utility from buying the product, (θ0 + θ1 a + θ2 p) where θ1 is the consumer’s social preference or marginal utility for the donation. Consumers make the purchase in a private setting (e.g. online or on a mobile phone) with no peer influence (e.g., sales person or solicitor). In addition to the usual consumption utility, the consumer is uncertain about her own altruism and derives additional ego utility from the inference she makes about herself based on her purchase decision: θ3 E (θ1 |a, p, x). θ3 measures the consumer’s ego utility.52 The consumer chooses to buy if the combination of her consumption utility and ego utility exceed the ego utility derived from not purchasing: (θ0 + θ1 a + θ2 p + ε1 ) + θ3 E (θ1 |a, p, 1) > ε0 + θ3 E (θ1 |a, p, 0)


where ε are choice-specific random utility shocks and θ3 is the marginal ego utility associated with the consumer’s self-belief about her own altruism, θ1 . In this self52 In a social setting, this ego utility could instead reflect the value a consumer derives from conveying a

“favorable impression” (i.e. signal) to her peers based on her observed action.

6 Conclusions

signaling model, the consumer’s decision is driven not only by the maximization of consumption utility, but also by the equilibrium signal the consumer derives from her own action. Purchase and non-purchase have differential influences on the con53 sumer’s derived inference  her own ego utility, E (θ1 |a, p, 0).  about 2 If ε˜ = ε1 − ε0 ∼ N 0, σ , then consumer choice follows the standard random coefficients probit model of demand with purchase probability conditional on receiving the offer (a, p) Pr (x = 1|a, p)  + ,  =  θ0 + θ1 a + θ2 p + θ3 E (θ1 |a, p, 1) − E (θ1 |a, p, 0) dF (θ )


where F (θ) represents the consumer’s beliefs about her own preferences prior to receiving the ticket offer. Note that low prices can dampen the consumer’s selfperception of being altruistic, E (θ1 |a, p, 0), and reduce ego utility. If ego utility overwhelms consumption utility, consumer demand could exhibit backward-bending regions that would be inconsistent with the standard neoclassical framework. Dubé et al. (2017b) test the self-signaling theory through a cause marketing field experiment in partnership with a large telecom company and a movie theater. Subject received text messages with randomly-assigned actual discount offers for movie tickets. In addition, some subjects were informed that a randomized portion of the ticket price would be donated to a charity. In the absence of a donation, demand is decreasing in the net price. In the absence of a discount, demand is increasing in the donation amount. However, when the firm uses both a discount and a donation, the observed demand exhibits regions of non-monotonicity where the purchase rate declines at larger discount levels. These non-standard moments are used to fit the self-signaling model above in Eq. (84). The authors find that consumer response to the cause marketing campaign is driven more by ego utility, θ3 , than by standard consumption utility.

6 Conclusions Historically, the computational complexity of microeconometric models has limited their application to consumer-level transaction data. Most of the literature has focused on models of discrete brand choice, ignoring the more complicated aspects of demand for variety and purchase quantity decisions. Recent advances in computing power have mostly eliminated these computational challenges. While much of the foundational work on microeconometric models of demand was based on the dual approach, the recent literature has seen a lot of innovation

53 As in Benabou and Tirole (2006), Dubé et al. (2017b) also include E (θ |a, p, x) in the ego utility to 2 moderate the posterior belief by the consumer’s self-perception of being sensitive to money.



CHAPTER 1 Microeconometric models of consumer demand

on direct models of utility. The dual approach is limiting for marketing applications because it abstracts from the actual form of “preferences” and requires strong assumptions, like differentiability, to apply Roy’s Identity. That said, many of the model specifications discussed herein require strong restrictions on preferences for analytic tractability, especially in the handling of corner solutions. These restrictions often rule out interesting and important aspects of consumer behavior such as income effects, product complementarity, and indivisibility. We view the development of models to accommodate these richer behaviors as important directions for future research. We also believe that the incorporation of ideas from behavioral economics and psychology into consumer models of demand will be a fruitful area for future research. Several recent papers have incorporated social preferences into traditional models of demand (e.g., DellaVigna et al., 2012; Kang et al., 2016; Dubé et al., 2017b). For a broader discussion of structural models of behavioral economics, see DellaVigna (2017). Finally, the digital era has expanded the scope of consumer-level data available. These new databases introduce a new layer of complexity as the set of observable consumer features grows, sometimes into the thousands. Machine learning (ML) and regulation techniques offer potential opportunities for accommodating large quantities of potential variables into microeconometric models of demand. For instance, these methods may provide practical tools for analyzing heterogeneity in consumer tastes and behavior, and detecting segments. Future work may benefit from developing approaches to incorporate ML into the already-computationally-intensive empirical demand models with corners. Finally, devising approaches to conduct inference on structural models that utilize machine learning techniques will also likely offer an interesting opportunity for new research (e.g., Shiller, 2015 and Dubé and Misra, 2019). This growing complexity due to indivisibilities, non-standard consumer behavior from the behavioral economics literature, and the size and scope of so-called “Big Data” raise some concerns about the continued practicality of the neoclassical framework for future research.

References Aguiar, M., Hurst, E., 2007. Life-cycle prices and production. American Economic Review 97 (5), 1533–1559. Ainslie, A., Rossi, P.E., 1998. Similarities in choice behavior across product categories. Marketing Science 17 (2), 91–106. Allcott, H., Diamond, R., Dube, J., Handbury, J., Rahkovsky, I., Schnell, M., 2018. Food Deserts and the Causes of Nutritional Inequality. Working Paper. Allenby, G., Garratt, M.J., Rossi, P.E., 2010. A model for trade-up and change in considered brands. Marketing Science 29 (1), 40–56. Allenby, G.M., Rossi, P.E., 1991. Quality perceptions and asymmetric switching between brands. Marketing Science 10 (3), 185–204.


Allenby, G.M., Shively, T.S., Yang, S., Garratt, M.J., 2004. A choice model for packaged goods: dealing with discrete quantities and quantity discounts. Marketing Science 23 (1), 95–108. Anderson, S.P., de Palma, A., 1992. The logit as a model of product differentiation. Oxford Economic Papers 44, 51–67. Anderson, S.P., de Palma, A., Thisse, J.-F., 1992. Discrete Choice Theory of Product Differentiation. The MIT Press. Andreoni, J., 1989. Giving with impure altruism: applications to charity and Ricardian equivalence. Journal of Political Economy 97, 1447–1458. Arora, N.A., Allenby, G., Ginter, J.L., 1998. A hierarchical Bayes model of primary and secondary demand. Marketing Science 17, 29–44. Becker, G.S., 1965. A theory of the allocation of time. The Economic Journal 75 (299), 493–517. Bell, D.R., Chiang, J., Padmanabhan, V., 1999. The decomposition of promotional response: an empirical generalization. Marketing Science 18 (4), 504–526. Benabou, R., Tirole, J., 2006. Incentives and prosocial behavior. American Economic Review 96, 1652–1678. Berry, S., Gandhi, A., Haile, P., 2013. Connected substitutes and invertibility of demand. Econometrica 81 (5), 2087–2111. Berry, S., Levinsohn, J., Pakes, A., 1995. Automobile prices in market equilibrium. Econometrica 63 (4), 841–890. Berry, S.T., 1994. Estimating discrete-choice models of product differentiation. Rand Journal of Economics 25 (2), 242–262. Besanko, D., Perry, M.K., Spady, R.H., 1990. The logit model of monopolistic competition: brand diversity. Journal of Industrial Economics 38 (4), 397–415. Bhat, C.R., 2005. A multiple discrete-continuous extreme value model: formulation and application to discretionary time-use decisions. Transportation Research, Part B 39, 679–707. Bhat, C.R., 2008. The multiple discrete-continuous extreme value (MDCEV) model: role of utility function parameters, identification considerations, and model extensions. Transportation Research, Part B 42, 274–303. Bhat, C.R., Castro, M., Pinjari, A.R., 2015. Allowing for complementarity and rich substitution patterns in multiple discrete-continuous models. Transportation Research, Part B 81, 59–77. Blattberg, R., Buesing, T., Peacock, P., Sen, S., 1978. Identifying the deal prone segment. Journal of Marketing Research 15 (3), 369–377. Blattberg, R.C., Briesch, R., Fox, E.J., 1995. How promotions work. Marketing Science 14 (3), G122–G132. Blattberg, R.C., Eppen, G.D., Lieberman, J., 1981. A theoretical and empirical evaluation of price deals for consumer nondurables. Journal of Marketing 45, 116–129. Blattberg, R.C., Neslin, S.A., 1989. Sales promotion: the long and short of it. Marketing Letters 1 (1), 81–97. Blattberg, R.C., Wisniewski, K.J., 1989. Price-induced patterns of competition. Marketing Science 8, 291–309. Blundell, R., Pashardes, P., Weber, G., 1993. What do we learn about consumer demand patterns from micro data? American Economic Review 83 (3), 570–597. Bodner, R., Prelec, D., 2002. Self-signaling and diagnostic utility in everyday decision making. In: Collected Essays in Psychology and Economics. Oxford University Press. Bronnenberg, B.J., Dhar, S.K., Dubé, J.-P., 2005. Market structure and the geographic distribution of brand shares in consumer package goods industries. Manuscript. Bronnenberg, B.J., Dubé, J.-P., 2017. The formation of consumer brand preferences. Annual Review of Economics 9, 353–382. Cardell, N.S., 1997. Variance components structures for the extreme-value and logistic distributions with applications to models of heterogeneity. Econometric Theory 13 (2), 185–213. Chambers, C.P., Echenique, F., 2009. Supermodularity and preferences. Journal of Economic Theory 144, 1004–1014.



CHAPTER 1 Microeconometric models of consumer demand

Chambers, C.P., Echenique, F., Shmaya, E., 2010. On behavioral complementarity and its implications. Journal of Economic Theory 145 (6), 2332–2355. Chiang, J., 1991. A simultaneous approach to the whether, what and how much to buy questions. Marketing Science 10, 297–315. Chiang, J., Lee, L.-F., 1992. Discrete/continuous models of consumer demand with binding nonnegativity constraints. Journal of Econometrics 54, 79–93. Chib, S., Seetharaman, P., Strijnev, A., 2002. Analysis of multi-category purchase incidence decisions using IRI market basket data. In: Advances in Econometrics: Econometric Models in Marketing. JAI Press, pp. 57–92. Chintagunta, P., 1993. Investigating purchase incidence, brand choice and purchase quantity decisions of households. Marketing Science 12, 184–208. Chintagunta, P., Dubé, J.-P., 2005. Estimating a stockkeeping-unit-level brand choice model that combines household panel data and store data. Journal of Marketing Research XLII (August), 368–379. Chintagunta, P., Dubé, J.-P., Goh, K.Y., 2005. Beyond the endogeneity bias: the effect of unmeasured brand characteristics on household-level brand choice models. Management Science 51, 832–849. Conley, T.G., Hansen, C.B., McCulloch, R.E., Rossi, P.E., 2008. A semi-parametric Bayesian approach to the instrumental variable problem. Journal of Econometrics 144 (1), 276–305. Deaton, A., Muellbauer, J., 1980a. An almost ideal demand system. American Economic Review 70 (3), 312–326. Deaton, A., Muellbauer, J., 1980b. Economics and Consumer Behavior. Cambridge University Press. DellaVigna, S., 2017. Structural behavioral economics. In: Handbook of Behavioral Economics. NorthHolland. Forthcoming. DellaVigna, S., List, J., Malmendier, U., 2012. Testing for altruism and social pressure in charitable giving. Quarterly Journal of Economics 127 (1), 1–56. Dolan, R.J., 1987. Quantity discounts: managerial issues and research opportunities. Marketing Science 6 (1), 1–22. Dotson, J.P., Howell, J.R., Brazell, J.D., Otter, T., Lenk, P.J., MacEachern, S., Allenby, G., 2018. A probit model with structured covariance for similarity effects and source of volume calculations. Journal of Marketing Research 55, 35–47. Draganska, M., Jain, D.C., 2004. A likelihood approach to estimating market equilibrium models. Management Science 50 (5), 605–616. Du, R.Y., Kamakura, W.A., 2008. Where did all that money go? Understanding how consumers allocate their consumption budget. Journal of Marketing 72 (November), 109–131. Dubé, J.-P., 2004. Multiple discreteness and product differentiation: demand for carbonated soft drinks. Marketing Science 23 (1), 66–81. Dubé, J.-P., Hitsch, G., Rossi, P., 2017a. Income and wealth effects on private label demand: evidence from the great recession. Marketing Science. Forthcoming. Dubé, J.-P., Luo, X., Fang, Z., 2017b. Self-signaling and pro-social behavior: a cause marketing mobile field experiment. Marketing Science 36 (2), 161–186. Dubé, J.-P., Misra, S., 2019. Personalized Pricing and Customer Welfare. Chicago Booth School of Business Working Paper. Dubois, P., Griffith, R., Nevo, A., 2014. Do prices and attributes explain international differences in food purchases? American Economic Review 2014 (3), 832–867. Einav, L., Leibtag, E., Nevo, A., 2010. Recording discrepancies in Nielsen Homescan data: are they present and do they matter? Quantitative Marketing and Economics 8 (2), 207–239. Engel, E., 1857. Die Productions- und Consumtionsver-haltnisse des Konigreichs Sachsen. Zeitschrift des Statistischen Bureaus des Koniglich Sachsischen Ministeriums des Innern 8, 1–54. Erdem, T., 1998. An empirical analysis of umbrella branding. Journal of Marketing Research 35 (3), 339–351. Erdem, T., Imai, S., Keane, M.P., 2003. Brand and quantity choice dynamics under price uncertainty. Quantitative Marketing and Economics 1, 5–64. Erdem, T., Keane, M.P., Öncü, T.S., Strebel, J., 2005. Learning about computers: an analysis of information search and technology choice. Quantitative Marketing and Economics 3, 207–246.


Erdem, T., Keane, M.P., Sun, B.-H., 1999. Missing price and coupon availability data in scanner panels: correcting for the self-selection bias in choice model parameters. Journal of Econometrics 89, 177–196. Gentzkow, M., 2007. Valuing new goods in a model with complementarity: online newspapers. American Economic Review 97 (3), 713–744. Gicheva, D., Hastings, J., Villas-Boas, S.B., 2010. Investigating income effects in scanner data: do gasoline prices affect grocery purchases? American Economic Review: Papers and Proceedings 100, 480–484. Goettler, R.L., Clay, K., 2011. Tariff choice with consumer learning and switching costs. Journal of Marketing Research XLVIII (August), 633–652. Goolsbee, A., Petrin, A., 2004. The consumer gains from direct broadcast satellites and the competition with cable TV. Econometrica 72 (2), 351–381. Guadagni, P.M., Little, J.D., 1983. A logit model of brand choice calibrated on scanner data. Marketing Science 2, 203–238. Gupta, S., 1991. Stochastic models of interpurchase time with time-dependent covariates. Journal of Marketing Research 28, 1–15. Gupta, S., Chintagunta, P., Kaul, A., Wittink, D.R., 1996. Do household scanner data provide representative inferences from brand choices: a comparison with store data. Journal of Marketing Research 33 (4), 383–398. Hanemann, W.M., 1984. Discrete/continuous models of consumer demand. Econometrica 52 (3), 541–561. Hartmann, W.R., Nair, H.S., 2010. Retail competition and the dynamics of demand for tied goods. Marketing Science 29 (2), 366–386. Hastings, J., Shapiro, J.M., 2013. Fungibility and consumer choice: evidence from commodity price shocks. Quarterly Journal of Economics 128 (4), 1449–1498. Hastings, J., Shapiro, J.M., 2018. How are SNAP benefits spent? Evidence from a retail panel. American Economic Review 108 (12), 3493–3540. Hausman, J.A., 1985. The econometrics of nonlinear budget sets. Econometrica 53 (6), 1255–1282. Heckman, J.J., 1978. Dummy endogenous variables in a simultaneous equation system. Econometrica 46 (4), 931–959. Heckman, J.J., 1981. The incidental parameters problem and the problem of initial conditions in estimating a discrete time-discrete data stochastic process and some Monte Carlo evidence. In: Manski, C., McFadden, D. (Eds.), Structural Analysis of Discrete Data with Econometric Applications. MIT Press, pp. 179–195 (Chap. 4). Heilman, C.M., Nakamoto, K., Rao, A.G., 2002. Pleasant surprises: consumer response to unexpected in-store coupons. Journal of Marketing Research 39 (2), 242–252. Hendel, I., 1999. Estimating multiple-discrete choice models: an application to computerization returns. Review of Economic Studies 66, 423–446. Hendel, I., Nevo, A., 2003. The post-promotion dip puzzle: what do the data have to say? Quantitative Marketing and Economics 1, 409–424. Hendel, I., Nevo, A., 2006a. Measuring the implications of sales and consumer inventory behavior. Econometrica 74 (6), 1637–1673. Hendel, I., Nevo, A., 2006b. Sales and consumer inventory. Rand Journal of Economics 37 (3), 543–561. Hicks, J., Allen, R., 1934. A reconsideration of the theory of value. Part I. Economica 1 (1), 52–76. Hoderlein, S., 2011. How many consumers are rational? Journal of Econometrics 164, 294–309. Honka, E., 2014. Quantifying search and switching costs in the U.S. auto insurance industry. Rand Journal of Economics 45, 847–884. Houthaker, H., 1953. La forme des courbes. Cahiers du Seminaire d’Econometrie 2, 59–66. Houthakker, H., 1961. The present state of consumption theory. Econometrica 29 (4), 704–740. Howell, J., Allenby, G., 2017. Choice Models with Fixed Costs. Working Paper. Howell, J., Lee, S., Allenby, G., 2016. Price promotions in choice models. Marketing Science 35 (2), 319–334. Joo, J., 2018. Quantity Surcharged Larger Package Sales as Rationally Inattentive Consumers’ Choice. University of Texas at Dallas Working Paper.



CHAPTER 1 Microeconometric models of consumer demand

Kang, M.Y., Park, B., Lee, S., Kim, J., Allenby, G., 2016. Economic analysis of charitable donations. Journal of Marketing and Consumer Behaviour in Emerging Markets 2 (4), 40–57. Kao, C., fei Lee, L., Pitt, M.M., 2001. Simulated maximum likelihood estimation of the linear expenditure system with binding non-negativity constraints. Annals of Economics and Finance 2, 215–235. Kim, J., Allenby, G.M., Rossi, P.E., 2002. Modeling consumer demand for variety. Marketing Science 21 (3), 229–250. Kim, J., Allenby, G.M., Rossi, P.E., 2007. Product attributes and models of multiple discreteness. Journal of Econometrics 138, 208–230. Lambrecht, A., Seim, K., Skiera, B., 2007. Does uncertainty matter? Consumer behavior under three-part tariffs. Marketing Science 26 (5), 698–710. Lee, J., Allenby, G., 2009. A Direct Utility Model for Market Basket Data. OSU Working Paper. Lee, L.-F., Pitt, M.M., 1986. Microeconometric demand systems with binding nonnegativity constraints: the dual approach. Econometrica 5, 123–1242. Lee, R.S., 2013. Vertical integration and exclusivity in platform and two-sided markets. American Economic Review 103 (7), 2960–3000. Lee, S., Allenby, G., 2014. Modeling indivisible demand. Marketing Science 33 (3), 364–381. Lee, S., Kim, J., Allenby, G.M., 2013. A direct utility model for asymmetric complements. Marketing Science 32 (3), 454–470. Lewbel, A.A., 2001. Demand systems with and without errors. American Economic Review 91 (3), 611–618. Liu, Y., Balachander, S., 2014. How long has it been since the last deal? Consumer promotion timing expectations and promotional response. Quantitative Marketing and Economics 12, 85–126. Luce, R.D., 1977. The choice axiom after twenty years. Journal of Mathematical Psychology 15, 215–233. Ma, Y., Ailawadi, K.L., Gauri, D.K., Grewal, D., 2011. An empirical investigation of the impact of gasoline prices on grocery shopping behavior. Journal of Marketing 75 (2), 18–35. Manchanda, P., Ansari, A., Gupta, S., 1999. The “shopping basket”: a model for multicategory purchase incidence decisions. Marketing Science 18, 95–114. Manchanda, P., Rossi, P.E., Chintagunta, P., 2004. Response modeling with nonrandom marketing-mix variables. Journal of Marketing Research 41 (4), 467–478. Manski, C.F., Sherman, L., 1980. An empirical analysis of household choice among motor vehicles. Transportation Research 14 (A), 349–366. Mas-Collel, A., Whinston, M.D., Green, J.R., 1995. Microeconomic Theory. Oxford University Press. Matejka, F., McKay, A., 2015. Rational inattention to discrete choices: a new foundation for the multinomial logit model. American Economic Review 105 (1), 272–298. McCulloch, R., Rossi, P.E., 1994. An exact likelihood analysis of the multinomial probit model. Journal of Econometrics 64, 207–240. McFadden, D.L., 1981. In: Structural Analysis of Discrete Data and Econometric Applications. The MIT Press, pp. 198–272 (Chap. 5). Mehta, N., 2007. Investigating consumers’ purchase incidence and brand choice decisions across multiple product categories: a theoretical and empirical analysis. Marketing Science 26 (2), 196–217. Mehta, N., 2015. A flexible yet globally regular multigood demand system. Marketing Science 34 (6), 843–863. Mehta, N., Chen, X.J., Narasimhan, O., 2010. Examining demand elasticities in Hanemann’s framework: a theoretical and empirical analysis. Marketing Science 29, 422–437. Mehta, N., Ma, Y., 2012. A multicategory model of consumers’ purchase incidence, quantity, and brand choice decisions: methodological issues and implications on promotional decisions. Journal of Marketing Research XLIX (August), 435–451. Melnikov, O., 2013. Demand for differentiated durable products: the case of the U.S. computer printer market. Economic Inquiry 51 (2), 1277–1298. Milkman, K.L., Beshears, J., 2009. Mental accounting and small windfalls: evidence from an online grocer. Journal of Economic Behavior & Organization 71, 384–394. Millimet, D.L., Tchernis, R., 2008. Estimating high-dimensional demand systems in the presence of many binding non-negativity constraints. Journal of Econometrics 147, 384–395.


Misra, S., 2005. Generalized reverse discrete choice models. Quantitative Marketing and Economics 3, 175–200. Muellbauer, J., 1974. Household composition, Engel curves and welfare comparisons between households. European Economic Review 10, 103–122. Nair, H.S., Chintagunta, P., 2011. Discrete-choice models of consumer demand in marketing. Marketing Science 30 (6), 977–996. Nair, H.S., Chintagunta, P., Dubé, J.-P., 2004. Empirical analysis of indirect network effects in the market for personal digital assistants. Quantitative Marketing and Economics 2, 23–58. Narayanan, S., Chintagunta, P., Miravete, E.J., 2007. The role of self selection, usage uncertainty and learning in the demand for local telephone service. Quantitative Marketing and Economics 5, 1–34. Neary, J., Roberts, K., 1980. The theory of household behavior under rationing. European Economic Review 13, 25–42. Neslin, S.A., Henderson, C., Quelch, J., 1985. Consumer promotions and the acceleration of product purchases. Marketing Science 4 (2), 147–165. Nevo, A., 2001. Measuring market power in the ready-to-eat cereal industry. Econometrica 69 (2), 307–342. Nevo, A., 2011. Empirical models of consumer behavior. Annual Review of Economics 3, 51–75. Ogaki, M., 1990. The indirect and direct substitution effects. The American Economic Review 80 (5), 1271–1275. Ohashi, H., 2003. The role of network effects in the US VCR market, 1978–1986. Journal of Economics and Management Strategy 12 (4), 447–494. Pakes, A., 2014. Behavioral and descriptive forms of choice models. International Economic Review 55 (3), 603–624. Pauwels, K., Srinivasan, S., Franses, P.H., 2007. When do price thresholds matter in retail categories? Marketing Science 26 (1), 83–100. Pesendorfer, M., 2002. Retail sales: a study of pricing behavior in supermarkets. Journal of Business 75 (1), 33–66. Petrin, A., Train, K.E., 2010. A control function approach to endogeneity in consumer choice models. Journal of Marketing Research 47 (1), 3–13. Phaneuf, D., Smith, V., 2005. Recreation demand models. In: Handbook of Environmental Economics. North-Holland, pp. 671–762. Pollak, R.A., Wales, T.J., 1992. Demand System Specification and Estimation. Oxford University Press. Ransom, M.R., 1987. A comment on consumer demand systems with binding non-negativity constraints. Journal of Econometrics 34, 355–359. Reinholtz, N., Bartels, D., Parker, J.R., 2015. On the mental accounting of restricted-use funds: how gift cards change what people purchase. Journal of Consumer Research 42, 596–614. Reiss, P.C., White, M.W., 2001. Household Electricity Demand Revisited. NBER Working Paper 8687. Samuelson, P.A., 1974. Complementarity: an essay on the 40th anniversary of the Hicks-Allen revolution in demand theory. Journal of Economic Literature 12 (4), 1255–1289. Seetharaman, P.B., Chib, S., Ainslie, A., Boatright, P., Chan, T.Y., Gupta, S., Mehta, N., Rao, V.R., Strijnev, A., 2005. Models of multi-category choice behavior. Marketing Letters 16 (3), 239–254. Seiler, S., 2013. The impact of search costs on consumer behavior: a dynamic approach. Quantitative Marketing and Economics. Shiller, B.R., 2015. First-Degree Price Discrimination Using Big Data. Working Paper. Song, I., Chintagunta, P., 2007. A discrete-continuous model for multicategory purchase behavior of households. Journal of Marketing Research 44 (November), 595–612. Stone, R., 1954. Linear expenditure systems and demand analysis: an application to the pattern of British demand. The Economic Journal 255 (64), 511–527. Sun, B., 2005. Promotion effect on endogenous consumption. Marketing Science 24 (3), 430–443. Sun, B., Neslin, S.A., Srinivasan, K., 2003. Measuring the impact of promotions on brand switching when consumers are forward looking. Journal of Marketing Research 40 (4), 389–405. Tamer, E., 2003. Incomplete simultaneous discrete response model with multiple equilibria. Review of Economic Studies 70, 147–165.



CHAPTER 1 Microeconometric models of consumer demand

Tamer, E., 2010. Partial identification in econometrics. Annual Review of Economics 2 (1), 167–195. Thaler, R., 1985. Mental accounting and consumer choice. Marketing Science 4 (3), 199–214. Thaler, R.H., 1999. Anomalies: saving, fungibility, and mental accounts. Journal of Economic Perspectives 4 (1), 193–205. Thomassen, O., Seiler, S., Smith, H., Schiraldi, P., 2017. Multi-category competition and market power: a model of supermarket pricing. American Economic Review 107 (8), 2308–2351. Train, K.E., McFadden, D.L., Ben-Akiva, M., 1987. The demand for local telephone service: a fully discrete model of residential calling patterns and service choices. Rand Journal of Economics 18 (1), 109–123. van Soest, A., Kapteyn, A., Kooreman, P., 1993. Coherency and regularity of demand systems with equality and inequality constraints. Journal of Econometrics 57, 161–188. van Soest, A., Kooreman, P., 1990. Coherency of the indirect translog demand system with binding nonnegativity constraints. Journal of Econometrics 44, 391–400. Varian, H.R., 1989. In: Price Discrimination. Elsevier Science Publishers, pp. 597–654 (Chap. 10). Villas-Boas, J.M., Winer, R.S., 1999. Endogeneity in Brand choice models. Management Science 45 (10), 1324–1338. Villas-Boas, J.M., Zhao, Y., 2005. Retailer, manufacturers, and individual consumers: modeling the supply side in the ketchup marketplace. Journal of Marketing Research XLII (February), 83–95. Wales, T., Woodland, A., 1983. Estimation of consumer demand systems with binding non-negativity constraints. Journal of Econometrics 21, 263–285. Walsh, J.W., 1995. Flexibility in consumer purchasing for uncertain future tastes. Marketing Science 14 (2), 148–165. Wooldridge, J.M., 2002. Econometric Analysis of Cross Section and Panel Data. MIT Press. Yang, S., Chen, Y., Allenby, G.M., 2003. Bayesian analysis of simultaneous demand and supply. Quantitative Marketing and Economics 1, 251–275. Yao, S., Mela, C.F., Chiang, J., Chen, Y., 2012. Determining consumers’ discount rates with field studies. Journal of Marketing Research 49, 822–841.


Inference for marketing decisions


Greg M. Allenbya , Peter E. Rossib,∗ a Fisher b Anderson

School of Business, Ohio State University, Columbus, OH, United States School of Management, University of California Los Angeles, Los Angeles, CA, United States ∗ Corresponding author: e-mail address: [email protected]

Contents 1 Introduction ...................................................................................... 2 Frameworks for inference ...................................................................... 2.1 A brief review of statistical properties of estimators.......................... 2.2 Distributional assumptions ....................................................... 2.3 Likelihood and the MLE .......................................................... 2.4 Bayesian approaches .............................................................. 2.4.1 The prior .......................................................................... 2.4.2 Bayesian computation ......................................................... 2.5 Inference based on stochastic search vs. gradient-based optimization.... 2.6 Decision theory..................................................................... 2.6.1 Firms profits as a loss function............................................... 2.6.2 Valuation of information sets.................................................. 2.7 Non-likelihood-based approaches ............................................... 2.7.1 Method of moments approaches ............................................ 2.7.2 Ad hoc approaches............................................................. 2.8 Evaluating models ................................................................. 3 Heterogeneity .................................................................................... 3.1 Fixed and random effects ........................................................ 3.2 Bayesian approach and hierarchical models................................... 3.2.1 A generic hierarchical approach ............................................. 3.2.2 Adaptive shrinkage ............................................................. 3.2.3 MCMC schemes ................................................................ 3.2.4 Fixed vs. random effects ...................................................... 3.2.5 First stage priors ................................................................ 3.2.6 Dirichlet process priors ........................................................ 3.2.7 Discrete first stage priors ...................................................... 3.2.8 Conclusions ...................................................................... 3.3 Big data and hierarchical models ............................................... 3.4 ML and hierarchical models ..................................................... 4 Causal inference and experimentation ...................................................... 4.1 The problem of observational data .............................................. 4.2 The fundamental problem of causal inference ................................ Handbook of the Economics of Marketing, Volume 1, ISSN 2452-2619, https://doi.org/10.1016/bs.hem.2019.04.007 Copyright © 2019 Elsevier B.V. All rights reserved.

70 72 74 76 77 79 82 83 84 85 86 87 88 88 91 92 93 94 98 98 100 101 101 101 104 106 106 107 107 108 109 111



CHAPTER 2 Inference for marketing decisions

4.3 Randomized experimentation .................................................... 4.4 Further limitations of randomized experiments ............................... 4.4.1 Compliance in marketing applications of RCTs ........................... 4.4.2 The Behrens-Fisher problem ................................................. 4.5 Other control methods ............................................................ 4.5.1 Propensity scores ............................................................... 4.5.2 Panel data and selection on unobservables ............................... 4.5.3 Geographically based controls ............................................... 4.6 Regression discontinuity designs................................................ 4.7 Randomized experimentation vs. control strategies .......................... 4.8 Moving beyond average effects .................................................. 5 Instruments and endogeneity .................................................................. 5.1 The omitted variables interpretation of “endogeneity” bias ................. 5.2 Endogeneity and omitted variable bias ......................................... 5.3 IV methods ......................................................................... 5.3.1 The linear case .................................................................. 5.3.2 Method of moments and 2SLS ............................................... 5.4 Control functions as a general approach ....................................... 5.5 Sampling distributions ............................................................ 5.6 Instrument validity ................................................................ 5.7 The weak instruments problem .................................................. 5.7.1 Linear models ................................................................... 5.7.2 Choice models................................................................... 5.8 Conclusions regarding the statistical properties of IV estimators........... 5.9 Endogeneity in models of consumer demand ................................. 5.9.1 Price endogeneity............................................................... 5.9.2 Conclusions regarding price endogeneity .................................. 5.10 Advertising, promotion, and other non-price variables ....................... 5.11 Model evaluation................................................................... 6 Conclusions ...................................................................................... References............................................................................................

113 115 115 117 118 118 119 120 121 122 122 123 124 126 127 128 128 129 131 133 135 135 138 141 141 142 143 144 144 145 146

1 Introduction Much has been written on the virtues of various inference frameworks or paradigms. The ultimate judgment regarding the usefulness of a given inference framework is dependent upon the nature of the inference challenges presented by a field of application. In this chapter, we discuss important challenges for inference presented by both the nature of the problems dealt with in marketing as well as the nature of the data available. Given that much of the current work in quantitative marketing is influenced by economics, we will also contrast the prevalent view in economics regarding inference with what we have found useful in marketing applications. One important goal of quantitative marketing is to devise marketing policies which will help firms to optimize their choice of marketing actions. For example, a firm might seek to improve profitability by better measurement of demand function for its products. Ultimately, we would like to maximize profitability over the space of

1 Introduction

policy functions which determine the levels and combinations of marketing actions. This goal imposes a high bar for inference and modeling, requiring a measurement of the entire surface which relates marketing actions to sales consequences, not just a derivative at a point or an average derivative. In addition to these problems in response surface estimation, some marketing decisions take on a discrete nature such as which person or sub-group to target for advertising exposure or which ad creative is best. For these actions, the problem is how to evaluate a large number of discrete combinations of marketing actions. To help solve the demanding problem of optimizing firm actions, researchers in marketing have access to an unprecedented amount of highly detailed data. Increasingly it is possible to observe highly disaggregate data on an increasing number of consumer attributes, purchase history, search behavior, and interaction with firms. Aggregation occurs over consumers, time, and products. At its most granular level, marketing data involves observing individual consumers through the process of consideration and purchase of specific products. In any given time period, most consumers purchase only a tiny fraction of the products available to them. Thus, the most common observation in purchase data is a “0.” In addition, products are only available in discrete quantities with the most commonly observed quantity of “1.” This puts a premium on models of demand which generate corner solutions as well as econometric models of discrete or limited dependent variables (see Chapter 1 for discussion of demand models which admit corner solutions). Consumer panel data features not only a very large number of variables which characterize consumer history but also very large number of consumers observed over a relatively short period of time. In the past, marketing researchers used only highly aggregate data where the aggregation is made over consumers and products. Typically, information about the consideration of products or product search was not available. Today, in the digital area at least, we observe search behavior from browsing history. This allows for the possibility that we can infer directly regarding consumer preferences before the point of purchase. In the past, only demographic or geo-demographic1 consumer characteristics were observed. Now we observe self-generated and other social media content which can help the researcher infer preferences. We also observe the social network of many if not most potential customers, opening up new possibilities for targeted marketing activities. This explosion of data holds out great promise for improved “data-based” marketing decisions, while at the same time posing substantial challenges to traditional estimation methods. For example, in pure predictive tasks we are faced with a huge number (more than 1 billion in some cases) of potential explanatory variables. Firms have been quick to use new sources of consumer data as the basis for marketing actions. The principle way this new data has been used is to target messages

1 Demographics inferred from the location of the residence of the consumer. Here the assumption is that consumers in a given geographic area are similar in demographic characteristics.



CHAPTER 2 Inference for marketing decisions

and advertisements in a much more customized and, hopefully, effective way. If, for example, which ad is displayed to a customer is a function of the customer’s preferences, this creates a new set of challenges for statistical methods which are based on the assumption that explanatory variables are chosen exogenously or as though they are a result of a process independent of the outcome variable. Highly customized and targeted marketing activities make these assumptions untenable and put a premium on finding and exploiting sources of exogenous (random-like) variation. Some would argue that the only true solution to the “endogeneity” problem created by targeted actions is true random variation of the sort that randomized experimentation is thought to deliver. Many in economics have come to a view that randomized experiments are one of the few ways to obtain a valid estimate of an effect of an economic policy. However, our goal in marketing is not just to estimate the effect of a specific marketing action (such as exposure to a given ad) but to find a policy which can help firms optimize marketing actions. It is not at all clear that optimization purely via randomized experimentation is feasible in the marketing context. Conventional randomized experiments can only be used to evaluate (without a model) discrete actions. With more than one marketing variable and many possibilities for each variable, the number of possible experiments required in a purely experimental approach becomes prohibitively large.2 In Section 2, we consider various frameworks for inference and their suitability given the desiderata of marketing applications. Given the historic emphasis on predictive validation in marketing and the renewed emphasis spurred by adoption of Machine Learning methods, it is important to review methods for evaluation models and inferences procedures. Pervasive heterogeneity in marketing applications has spurred a number of important methodological developments which we review in Section 3. We discuss the role of causal inference in marketing applications, discussing the advantages and disadvantages of experimental and non-experimental methods in Section 4. Finally, we consider the endogeneity problem and various IV approaches in Section 5.

2 Frameworks for inference Researchers in marketing have been remarkably open to many different points of view in statistical inference, taking a decidedly practical view that new statistical methods and procedures might be useful in marketing applications. Currently, marketing researchers are busy investigating the potential for Machine Learning techniques to be useful in marketing problems. Prior to Machine Learning, Bayesian methods made considerable inroads into both industry and academic practice. Again, Bayesian methods were welcomed into marketing as long as they proved to be worthwhile in

2 Not withstanding developments that use approximate multi-arm bandit solutions to reduce the cost of experimentation with many alternatives (Scott, 2014).

2 Frameworks for inference

the sense of compensating for the costs of using the methods with improved inferences and predictions. In recent years, a number of researchers in marketing as well as allied areas of economics have brought perspectives from their training in economics to bear on marketing problems. In this section, we will review the various paradigms for inferences and provide our perspective on the relative advantages and disadvantages of each approach to inference. To prove a concrete example for discussion, consider a simple marketing response model which links sales to marketing inputs. y = f (x|θ ) Here y is sales and x is a vector of inputs such as prices or promotional/advertising variables. The goal of this modeling exercise is to estimate the response surface for the purpose of making predictions regarding the level of sales expected for over the input space. These predictions can then be used to maximize profits and provide guidance to improve firm policies regarding the setting of these inputs. In the case of price inputs, the demand theory discussed in Chapter 1 of this volume can be used to select the functional form for f (). However, most researchers would want to explicitly represent the fact that sales data is not a deterministic function of the input variables – and represent deviations of y from that predicted from f () as corresponding to draws from error distributions. It is common to introduce an additive error term into this model. y = f (x|θ ) + ε


There are several ways to view ε. One way is to view the error term as arising from functional form approximation error. In this view, the model parameters can be estimated via pure projection methods such as non-linear least squares. Since the estimator is a projection, the error term is, by construction, orthogonal to f . Another interpretation of the error term is as arising from omitted variables (such as variables which describe the environment in which the product is sold but are not observed or included in the x). These could also include unobservable demand shocks. In random utility models of demand, the error terms are introduced out of convenience to “rationalize” or allow for the fact that when markets or individuals are faced with the same value of x, they don’t always demand the same quantity. For these situations, some further assumptions are required in order to perform inference. If we assume that the error terms are independent of x,3 then we can interpret the projection estimator as arising from the moment condition   Ex,ε ∇f (x|θ ) ε = 0. Here ∇f is the gradient of the response surface with respect to θ . This moment condition can be used to rationalize the non-linear least squares projection in the 3 In linear models, this assumption is usually given as ε is mean independent of x, E [ε|x] = 0. Since we are allowing for a general functional form in f , we must make a stronger assumption of full independence.



CHAPTER 2 Inference for marketing decisions

sense that we are choosing parameter values, θ, so that sample moment is held as close as possible to the population moment and the sample moment condition is the first order condition for non-linear least squares. The interpretation of least squares as a method of moments estimator based on assumptions about the independence of the error term and the gradient of the response surface provides a “distribution-free” basis for estimation. This is the sense that we do not have to specify a particular distribution for the error term. The independence assumption assumes that the x variables are chosen independently of the error term or shocks to aggregate demand (conditional on x). In settings where the firm chooses x with some partial knowledge of the demand shocks, the independence assumption is violated and we must resort to other methods of estimation. In Section 5.3 below, we consider these methods.

2.1 A brief review of statistical properties of estimators This chapter is not designed to be a reference on inference methods in general, but, instead, to discuss how features of marketing applications make particular demands of inference and what methods have shown promise in the marketing literature. However, to fix notation and to facilitate discussion, we provide a brief review of statistical properties of estimation procedures. Prior to obtaining or analyzing a dataset, it is entirely reasonable to choose an estimation procedure for model parameters on the basis of the general properties of the procedure. Statistical properties for an estimation procedure are deduced by regarding the procedure as specifying a function of the data and studying the sampling properties of the estimator by considering the distribution of the estimator over repeated samples from a specific data generation mechanism. That is, we have an estimator, θˆ = g(D), where D represents that data. The estimator is specified by the function g(). Viewed in this fashion, the estimator is a random variable whose distribution comes from the distribution of the data via the summary function, g. The performance of the estimator must be gauged by specifying a loss function and examining the distribution of loss. A common loss function is the squared error loss function, (θˆ , θ ) = (θˆ − θ )t A(θˆ − θ ), where A is a positivedefinite weighting matrix. We can evaluate alternative estimation procedures by the comparing their distribution of loss. Clearly, we would prefer estimators with loss massed near zero. For convenience, many use MSE ≡ ED [(θˆ (D), θ )] as a measure of the distribution of squared error and look for estimators that offer the lowest possible value of MSE. Unfortunately, there is no general solution to problem of finding the minimal MSE procedure for all possible values of θ even conditional on a specific family of data generating mechanisms/distributions. This problem is well illustrated by shrinkage estimation methods. Shrinkage methods modify an existing estimation procedure by “shrinking” the point estimates toward some point in the parameter space (typically 0). Thus, shrinkage methods will have lower MSE than base estimation procedure for values of θ near the shrinkage point but higher MSE far from the point.

2 Frameworks for inference

As a practical matter, we do not expect arbitrarily large values of θ and this gives shrinkage methods an advantage in most applications. But the point is that shrinkage methods dominate the base method because they attempt to improve upon for only certain parts of the parameter space at the expense of inferior performance elsewhere. The best we can say is that we would like to use estimators in the admissible class of estimators – estimators that can’t be improved upon everywhere in the parameter space. Another way to understanding the problem is to observe that MSE can be reexpressed as relating to bias and the sampling variance. For a scalar estimator,  2 ˆ − θ + V(θˆ ) = Bias2 + Variance MSE = E[θ] Clearly, if one can find an estimation procedure that reduces both bias and sampling variance, this procedure will improve MSE. However, at some point in the pursuit of efficient estimators, there may well be a trade-off between these two terms. That is, we can reduce overall MSE by a favorable trade-off of somewhat larger bias for an even larger reduction in variance. Many of the modern shrinkage and variable selection procedures exploit this trade-off by finding a favorable point on the bias-variance trade-off frontier. A further complication in the evaluation of estimation procedures is that the sampling distribution (and MSE) can be extremely difficult to derive and there may be no closed-form expression for MSE. This has led many statisticians and econometricians to resort to large sample or asymptotic approximations.4 A large sample approximation is the result of an imaginary sampling experiment in which the sample size is allowed to grow indefinitely. While we may not be able to derive the sampling distribution of an estimator for a fixed N , we may be able to approximate its distribution for arbitrarily large N or in a limiting sense. This approach consists of two parts: (1) a demonstration that the distribution of an estimator is massed arbitrarily close to the true value for large enough N and (2) the use of some variant of the Central Limit Theorem to provide a normal large sample or asymptotic approximation to the sampling distribution. In a large sample framework, we consider infinite increases in the sample size. Clearly, any reasonable procedure (with enough independence in the data) should “learn” about the true parameter value with access to an unlimited amount of data. A procedure that does not learn at all from large and larger datasets is clearly a fundamentally broken method. Thus, a very minimal requirement of all estimation procedures is that as N grows to infinity, we should see the sampling distribution massed closer and closer to the true value of θ with smaller and smaller MSE. This property is called consistency and is usually defined by what is called a probability limit or plim. The idea of a plim is very simple – if the mass of the sampling distribution becomes concentrated closer and closer to the true value then we should be able 4 We note that, contrary to popular belief, the bootstrap does not provide finite sample inference and can only be justified by appeal to large sample arguments.



CHAPTER 2 Inference for marketing decisions

to find a sample size sufficiently large that for any sample of that size or larger we can make the probability that the estimator lies near the true value as large as possible. It should be emphasized that the consistency property is a very minimal property. Among the set of consistent estimation procedures, there can be some procedures with lower MSE than others. As a simple example, consider estimation of the mean. The Law of Large numbers tells us that, under very minimal assumptions, the sample mean converges to the true mean. This will be true whether we use all observations in our sample or only every 10th observation. However, if we estimate the mean using only 1/10 of the data, this procedure will be consistent but inefficient relative to the standard procedure. If an estimation procedure has a finite sample bias, this bias will be reduced to zero in the asymptotic experiment or else the estimator would be inconsistent. For this reason, consistent estimators are sometimes referred to as “asymptotically unbiased.” In other words, in large sample MSE converges to just sampling variance. Thus, asymptotic evaluation of estimators is entirely about comparison of sampling variance. Statistical efficiency in large samples is measured by sampling variance.

2.2 Distributional assumptions To complete the econometric specification of the sales response model in (1), some would argue that we have to make an assumption regarding the joint distribution of the vector of error terms. Typically, we assume that the error terms are independent which is not really a restrictive assumption in a cross-section of markets. There is a prominent school of thought in econometrics that one should not make any more distributional assumptions than minimally necessary to identify the model parameters. This view is associated with the Generalized Method of Moments (see Hansen, 1982) and applications in macro/finance. In these applications, the moment conditions are suggested by economic theory and, therefore, are well-motivated. In our setting, the moment restriction is motivated by assumptions regarding the independence of the error term from the response variables which is not motivated by appeal to any particular economic theory of firm behavior. For these reasons, we believe that the argument for not making additional distributional assumptions is less forceful in marketing applications. Another related argument from the econometrics literature is what some call the “consistency-efficiency” trade-off. Namely, that we might be willing to forgo the improved statistical efficiency afforded by methods which employ more modeling assumptions in exchange for a more “robust” estimator which gives up some efficiency in exchange for providing consistent estimates over a wider range of possible model specifications. In the case of a simple marketing mix response model and additive error terms with a continuous sales response, changes in the distribution model for the errors will typically not result in inconsistent parameter estimators. However, various estimators of the sampling variance of the estimators may not be consistent in the presence of heteroskedastic error terms. For this reason, various authors (see, for example, Angrist and Pischke, 2009) endorse the use of Eicker-White style heteroskedastic consistent estimators of the variance. If a linear approximation to an

2 Frameworks for inference

underlying non-linear response function is used, then we might expect to see heteroskedastic errors as the error term would include functional form approximation error. In marketing applications, the use of a linear regression function can be much less appealing than in economics. We seek to optimize firm behavior and linear approximations to sales response models will not permit computation of global optima. For these reasons, we believe it is far less controversial, in marketing as opposed to economics, to complete the econometric specification with a specific distributional assumption for basic sales response and demand models. Moreover, in demand models which permit discrete outcomes as discussed in Chapter 1, then the specification of the error term is really part of the model and few are interested in model inferences that are free from a specific distributional choice for the model error terms. For example, the pervasive use of logit and other closely related models is based on the use of extreme value error terms which can be interpreted as marginal utility errors. While one might want to change the assumed distribution of error terms to create a different model specification, we do not believe that “distribution-free” inference has much value in models with discrete outcomes.

2.3 Likelihood and the MLE Invariably, we will want to make inferences either for the generic response surface models as in (1) or in demand models which are derived from an assumed direct utility function along with a distribution of marginal utility errors. The generic demand model can be written as y = g (p, E, ε|θ)


where y is a vector of quantities demanded, p is a vector of prices, E is expenditure allocated to the group of products in the demand group, and ε is a vector of marginal utility errors. In either the case of (1) or (2), if we postulate a distribution of the error terms, this will induce a distribution on the response variable in (1) or the vector of demanded quantities given above. As is well known, this will allow us to compute the likelihood of the observed data. In the case of the generic sales response model, the likelihood is the joint distribution of the observed data. p (y, x|θ, ψ) = p (y|x, θ) p (x|ψ)


Here p (y|x, θ), the conditional distribution y|x is derived from (1) along with an assumed distribution of the error term. With additive error terms, the Jacobian from ε to y is 1. In the demand model application, x consists of (p, E) and the Jacobian may be more complicated due (see Chapter 1 for many examples). In the case where the error terms are independent of the right hand side variables, the marginal distribution of the right hand variables, e.g. p (x|ψ) above, will not depend on the parameters that govern the dependence of the left hand side variable on the right hand side variables. Under these assumptions, the likelihood for θ is proportional to the conditional



CHAPTER 2 Inference for marketing decisions

distribution.  (θ ) ∝ p (y|x, θ)


A very powerful principle is the Likelihood Principle which states that all samplebased information is reflected in the likelihood function. Or, put another way, the likelihood is sufficient for the data. Two datasets with the same likelihood function are informationally equivalent with respect to inferences about θ even though the datasets do not have to have identical observations.5 Another consequence of the likelihood principle is that approaches which are not based on likelihood function are based on potentially inferior information and, therefore, may not be efficient. Some statisticians believe that one must base inferences procedures on more than just the likelihood and do not endorse the likelihood principle. Typically, this is argued via special and somewhat pathological examples where strict adherence to the likelihood principle produces non-sensical estimators. We are not aware of any practical example in which it is necessary to take into account more than the likelihood to make sensible inferences. Exactly how the likelihood is used to create estimators does not follow directly from the likelihood principle. The method of maximum likelihood creates an estimator using the maximum of the likelihood function. θˆMLE = arg min  (θ) = f (y, x)


The MLE appears to obey the likelihood principle in the sense that the MLE depends only on the likelihood function. However, the MLE only uses one feature of the likelihood function (the max). The analysis of the MLE depends only on the local behavior of the likelihood function in the vicinity of the MLE. The properties of the MLE can only be deduced via large sample analysis. Under moderately weak conditions, the MLE is consistent and asymptotically normal. The most attractive aspect of the MLE is that it attains the Cramer-Rao lower bound for sampling variance in large samples. In other words, there is no other consistent estimator which can have a smaller sampling variance in large samples. However, the finite sample properties of the MLE are not especially compelling. There are examples where the MLE is inadmissible. The analysis of the MLE strongly suggests if an estimator is to be considered useful it should be asymptotically equivalent to the MLE. However, in finite samples, an estimator may differ appreciably from the MLE and have superior sampling properties. From a practical point of view, to use the MLE the likelihood must be evaluated at low cost and maximization must be feasible. In order for asymptotic inference to be conducted, the likelihood must be differentiable at least in a neighborhood of the 5 In cases where the sampling mechanism does not depend on θ , then the likelihood principle states that

inference should ignore the sampling mechanism. The classic example is the coin toss experiment. It does not matter whether the observed data of m heads in n coin tosses was acquired by tossing a coin n times or tossing a coin until m heads appear. This binomial versus negative binomial example has spurred a great deal of debate.

2 Frameworks for inference

maximum. Typically, researchers used non-linear programming techniques to maximize the likelihood and most of these methods are gradient-based. There are other methods such as simulated annealing and simplex methods which do not require the gradient vector; however, these methods are slow and often impractical with a large number of parameters. In addition, a maximum of the likelihood is not useful without a method for conducting inference which requires computation of gradients and or second derivatives. In many settings, we assume that economic agents act with a larger information set than is available to the researcher. In these situations, the likelihood function for the observed data must be evaluated by integrating over the distribution of unobservable variables. In many cases, this integration must be accomplished by numerical means. Integration by simulation-based methods creates a potentially non-smooth likelihood and can create problems for those who use gradient-based non-linear programming algorithms to maximize the likelihood. Inference for the MLE is accomplished by reference to the standard asymptotic result:    √  N θˆMLE − θ ∼ ˙ N 0, I−1  2  ln  where I is the information matrix, I = −E ∂∂θ∂θ t . For any but the most trivial models, the expected information matrix must be approximated, typically by using an estimate of the Hessian evaluated at the MLE. Most optimizers provide such an estimate on convergence. The quality of Hessian estimates particularly where numerical gradients are used can be very low and it is advisable to expend additional computational power at the optimum to obtain the highest quality Hessian estimate possible. Typically, numerical Hessians should be symmetrized.6 In some cases, numerical Hessians can be close to singularity (ill-conditioned) and cannot be used to estimate the asymptotic variance-covariance matrix of the MLE without further “regularization.”7 The average outer product of the log-likelihood function can also be used to approximate the information matrix.

2.4 Bayesian approaches As mentioned above, there is a view by some econometricians that any likelihoodbased approach requires additional assumptions that may not be supported by economic theory and that can be viewed as somewhat arbitrary. For this reason, nonlikelihood approaches which make minimal assumptions are viewed favorably. In our view, the fact that one has to specify a complete model is a benefit rather than a cost of a likelihood-based approach to inference. That is to say, it is easy to see that some data has zero likelihood even though non-likelihood based methods might be used to

6 A∗ = .5A + .5At . 7 Typically achieved by adding a small amount to each of the diagonal elements of the Hessian estimate.



CHAPTER 2 Inference for marketing decisions

make inferences. For example, if we postulate a discrete choice model, we must eliminate any data that shows consumers purchasing more than one product in the demand group at a single time. This would violate the mutually exclusive assumption of the choice model. However, we could use moment conditions to estimate a choice model which has zero likelihood for a given dataset. To guard against mis-specification, our view is that a more fruitful approach is to develop models specifically designed to accommodate the important features of marketing datasets and to be able to easily change the specifications to perform a sensitivity analysis. Given that the MLE has only desirable large sample properties, there is a need for a inference framework which adheres to the likelihood principle but offers better finite sample properties. It is well known than Bayesian procedures are asymptotically similar to the MLE but have desirable finite sample properties. Bayes estimators can be shown to be admissible and it can also be shown that all admissible estimators are Bayes. Bayesian procedures are particularly useful in very high dimensional settings and in highly structured multi-level models. Bayesian approaches have many favorable properties including shrinkage and adaptive shrinkage (see Section 3.2 for discussion of adaptive shrinkage) in which the shrinkage adapts to the information in the data. In the machine learning literature, these properties are called “regularization” which simply means reducing sampling error by avoiding outlandish or absurd estimates and by various forms of shrinkage. Many popular regularization methods such as the Lasso can be given a Bayesian interpretation (Park and Casella, 2008). There are many treatments of the Bayesian approach (see, for example, Gelman et al., 2004 and Rossi et al., 2005). We briefly review the Bayesian approach. Bayesians take the point of view that any unknown quantity (including model parameters, but also including predictions and which model governs the data) should be described by a probability distribution which represents our current state of information. Given the limited information available to any researcher, no quantity that cannot be directly measured is known with certainty. Therefore, it is natural to use the machinery of probability theory to characterize the degree of uncertainty regarding any unknown quantity. Information arises from two sources: (1) prior information and (2) from the data via the likelihood. Prior information can either come from other datasets or from economic theory (such as monotonicity of demand) or from “structure” such as I believe there is a super-population from which a given dataset is drawn from. Bayes theorem provides the way in which prior and sample information is brought together to make “after the data” or a posteriori inferences. In the case of our simple response model example, Bayes theorem states p (θ |y, X) =

p (y, θ|X) p (y|X, θ) p (θ ) = p (y|X) p (y|X)


Thus, given a prior distribution on the response parameters, Bayes theorem tells us how to combine this prior with the likelihood to obtain an expression for the posterior distribution. The practical value of Eq. (6) is that the posterior distribution of the model parameters is proportional to the likelihood times the prior density. p (θ |y, X) ∝  (θ |y, X) p (θ )


2 Frameworks for inference

All features of the posterior distribution are inherited from the likelihood and the prior. Of course, this equation does not define an estimator without some further assumptions.  Under squared error loss, the posterior mean is the Bayes estimator, θˆBayes = θp (θ |y, X) dθ . However, Bayesians do not think of the Bayesian apparatus as simply a way to obtain an estimator but rather as a complete, and different, method of inference. The posterior distribution provides information about what we have learned from the data and the prior and expresses this information as a probability distribution. The degree of uncertainty can be expressed by the posterior probability of various subsets of the parameter space (for example, the posterior probability that a price elasticity parameter is less than −1.0). The marginal distributions of individual parameters in the θ vector are often used to characterize uncertainty though the computation of the Bayesian analogues of the familiar standard errors and confidence intervals. The posterior standard deviation for an element of the θ vector is analogous to the standard error and the quantiles of the posterior distribution can be used to construct a Bayesian “credibility” interval (the analogue of the confidence interval). Of course, the joint posterior offers much more information than the marginals – unfortunately this information is rarely explored or provided. It is also important to note that Bayes estimators use both information from the prior and the likelihood. While the MLE is based only the likelihood, an informative prior serves to modify location of posterior and influences the Bayes estimators. All Bayes estimators with informative priors can be interpreted as a form of shrinkage estimator. The likelihood is centered at the MLE which (within the confines of the particular parametric model) is a “greedy” estimator which tries to fit the data according the likelihood criterion. The prior serves to “shrink” or pull the Bayes estimator into a sort of compromise between the prior and the likelihood. In simplified conjugate cases, the posterior mean can be written as an average of the prior mean and the MLE where the weights in the average depend on the relative informativeness of the prior relative to the likelihood. Often the prior mean is set to zero and the Bayes estimator will shrink the MLE toward zero which improves sampling properties by exploiting the bias-variance tradeoff. However, as the sample size increases, the weight accorded the likelihood increases relative to the prior, reducing shrinkage and allowing the Bayes estimator to achieve consistency. Most Bayesian statisticians are quick to point out that there are very fundamental differences between the Bayesian measure of uncertainty and the sampling theoretic ones. Bayesian inference procedures conditional on the observed data which differs dramatically from sampling theoretic approaches that consider imaginary experiments in which new datasets are generated from the same model. Bayesians argue very convincingly that sampling properties can be useful to select a method of inference but that the applied researcher is interested in the information content of a given dataset. Moreover, Bayesian procedures do not depend on asymptotic experiments which are even of more questionable relevance for a researcher who wishes to summarize the information in one finite sample. The Bayesian approach is appealing due to superior sampling properties coupled with the appropriate inference statements that conditional on a specific dataset. The



CHAPTER 2 Inference for marketing decisions

problem is that the Bayesian approach appears to exact higher costs than a standard MLE or method of moments approach. The cost is twofold: (1) a prior distribution must be provided and (2) some practical method must be provided to compute the many integrals that are used to summarize the posterior distribution.

2.4.1 The prior Recognizing that the prior is an additional “cost” which many busy researchers might not be interested in providing, the Bayesian statistics literature pursued the development of various “reference” priors.8 The idea is that the “reference” priors might be agreed upon by researchers as providing modest or minimal prior information and that the “reference” priors can be assessed at low cost. It is our view that the literature on reference priors is largely a failure in the sense that there can’t really be one prior or form of prior that is satisfactory in all situations. Our view is that informative priors are useful and that even a modest amount of prior information can be exceptionally useful in the sense of eliminating absurd or highly improbable parameter estimates. Priors and prior structure become progressively more important as the parameterization of models becomes increasingly complex and high dimensional. Bayesian methods have become almost universally adopted in the analysis of conjoint survey data9 and in the fitting of marketing mix models. These successes in adoption of Bayesian methods come from the regularization and shrinkage properties afforded Bayesian estimators, particularly in what is called the hierarchical setting. In Section 3, we explore Bayesian approaches to the analysis of panel data which provide an element of what is termed adaptive shrinkage – namely, a multilevel prior structure which can be inferred partly on the basis of data from other units in the panel. This notion, called “borrowing strength,” is a key attribute of Bayesian procedures with highly structured and informative priors. In summary, the assessment of an informative prior is a requirement of the Bayesian approach beyond likelihood. In low dimensional settings, such as a linear regression with a small number of potential regressors and only one cross-section or time series, any in a range of moderately informative priors will produce similar results and, therefore, the prior is not important or particularly burdensome. However, in high dimensional settings such as flexible non-parametric models or in the case of panel data where there are many units and relatively few observations, the Bayesian approach provides a practical procedure where the prior is important and confers important sampling benefits. Even the simple regression example becomes a convincing argument for the Bayesian approach when the number of potential regressors becomes extremely large. Many of the variable selection Machine Learning techniques can be interpreted as Bayesian procedures. In the background, there is an informative prior which is assessed indirectly, typically through cross-validation. 8 See, for example, Bernardo and Smith (1994), Sections 5.4 and 5.62. 9 For example, Sawtooth Software implements a Bayesian hierarchical model for choice-based conjoint.

Sawtooth is the marketshare leader. SAS has several procedures for fitting systems of regression equations using Bayesian approaches and these are widely applied in the analysis of aggregate market data.

2 Frameworks for inference

Thus, Bayesian methods have become useful even in the simple linear regression problem. For example, Lasso and Ridge regression are Bayesian methods with particular prior forms (see, for example, Park and Casella, 2008).

2.4.2 Bayesian computation Historically, Bayesian methods were not used much in applications due to the problems with summarizing the posterior via computation of various integrals. For example, the minimal requirement of many investigators is to provide a point estimate and a measure of uncertainty. For the Bayesian, this involves computing the posterior mean and posterior standard deviation both of which are integrals involving the marginal posterior distribution of a given parameter. That is, inference regarding θi requires computation of the marginal posterior distribution, p (θ |y, X) dθ−i pi (θi |y, X) = θ−i

Here θ−i is all elements of the θ vector except the ith element. In addition, we must compute the normalizing constant for the posterior since the likelihood times the prior is only proportional to the posterior. Clearly, these integrals are available as closedform solutions for only very special cases. Numerical integration methods such as quadrature-based methods are only effective for very low dimensional integrals. The current popularity of Bayesian methods has been make possible by various simulation-based approaches to posterior computation. Obviously if we could simulate from the posterior at low computational cost and only require knowledge of the posterior up to a normalizing constant, this would provide a practical solution. While iid samplers from arbitrary multivariate distributions are not practical, various Markov Chain methods have been devised that can effectively simulate from the posterior at low cost. These MCMC (Markov Chain Monte Carlo) methods (see the classic treatment in Robert and Casella (2004) for a complete treatment and Rossi et al. (2005) for many applications to models of interest to marketing) create a continuous state space Markov Chain whose invariant distribution is the posterior. The accuracy of this method is determined by-the ability to simulate large number of draws from the Markov Chain at low cost as well as our ability to construct Markov Chains with limited dependence. MCMC-based Bayes procedures use simulations or draws from the Markov Chain to compute the posterior expectation of any arbitrary function of the model parameters. For example, if we have R draws from the chain, then a simulation-based estimate of the posterior expectation of any function can be obtained by simply forming an average over the function evaluated at each of these R draws. Typically, we use extremely large numbers of draws (typically greater than 10,000) to ensure that the error in the simulation approximation is small. Note that the number of draws used in under the control of the investigator (contrast to the fixed sample size of the data). R   1

 Eθ|y,X g (θ) = g (θr ) R r=1




CHAPTER 2 Inference for marketing decisions

Thus, our simulation-based estimates are averages of draws from a Markov Chain constructed with an invariant distribution equal to the posterior. While there is a very general theory that assures ergodicity (convergence of these ensemble averages to posterior expectations), in practice, the draws from the Markov Chain can be highly correlated, requiring a very large number of draws. For the past 25 years, Bayesian statisticians and econometricians have enlarged the class of models which can be treated successfully by MCMC methods. Advancement in computation also has allowed applications to data sets and models that were previously thought to be impossible to analyze. We routinely analyze highly nonlinear choice models with thousands of cross-sectional units and tens of thousands of observations. Multivariate mixtures of normals in high dimensions and with a large number of components can be implemented on laptop computing equipment using an MCMC approach. Given the availability of MCMC methods (particularly the Gibbs Sampler), statisticians have realized that many models whose likelihoods involve difficult integrals can be analyzed with Bayesian methods using the principle of data augmentation. For example, models with latent random variables that must be integrated out to form the likelihood can be analyzed from in a Bayesian approach by augmenting the parameter space with these latent variables and defining a Markov Chain on this “augmented” state space. Marginalizing (integrating) out the latent can be achieved trivially by simply discarding draws of the latent variable. These ideas have been applied very successfully to random coefficient models as well as models like the multinomial and multivariate Probit. In summary, the Bayesian approach was previously thought to be impractical given lack of a computational strategy for computing various integrals of the nonnormalized posterior. MCMC methods have not only eliminated this drawback but, with the advent of data augmentation, have made Bayesian procedures the only practical methods for models with a form of likelihood that involves integration. The current challenge to Bayesian methods is to apply to truly enormous data sets generated by millions of consumers and a vast number of potential explanatory variables. As currently implemented, MCMC methods are fundamentally sequential in nature (the rth simulate of the Markov Chain depends on the value of the (r − 1)st simulate). The vast computing power currently available is obtained not by the speed of any one given processor but the ability to break the computing task into pieces and farm this out to large array of processors. Sequential computations do not naturally lend themselves to anything other than very tightly coupled computer architectures. This is a current area of research which awaits innovation in our methods as well as possible changes in the computing environment. In Section 3, we review some of the approaches to “scaling” MCMC methods to truly huge panel data sets.

2.5 Inference based on stochastic search vs. gradient-based optimization Up to this point, we have viewed MCMC methods as a method for indirectly sampling from the posterior distribution. The theory of MCMC methods says that, if Markov

2 Frameworks for inference

Chain is allowed to run long enough, the procedure will visit any “appreciable” set with frequency proportional to the posterior probability of that set. Clearly, then a MCMC sampler will visit high probability areas of the posterior much more often low probability areas. In some cases, the set of possible parameter values is so large that, as a practical matter, the MCMC will only visit a part of the parameter space. For example, consider the application of Bayesian methods to the problem of selecting regression models from the space of all possible regression models (if there are k possible regressions there are 2k possible models). For k > 30, any of the MCMC methods (see, for example, George and McCulloch, 1997) will visit only a subset of the possible models and one would not necessarily want to use the frequency of model visits as an estimate of the posterior probability of a model. Thus, one way of looking at the MCMC method is as a method of stochastic search. All MCMC methods are designed to draw points from the parameter space with some degree of randomness. Standard methods such as the Gibbs Sampler or random walk MCMC methods do not rely on any gradient information regarding the posterior (note: there are variational and stochastic gradient methods that do use gradient information). This means that the parameter space does not have to be continuous (as in the case of variable selection) nor does the likelihood have to be smooth. There are some problems which give rise to a likelihood function with discrete jumps (see, for example, Gilbride and Allenby, 2004). Gradient-based MLE methods simply cannot be applied in such situations. However, an MCMC method does not require even continuity of the likelihood function. In other situations, the likelihood function requires an integral. The method of simulated maximum likelihood simply replaces that integral with a simulation-based estimate of the integral. For example, the integral might be taken over a normal distribution of consumer heterogeneity. Given the simplicity and low cost of normal draws, a simulated MLE seems to be a natural choice to evaluate the likelihood function numerically. However, given that only a finite number of draws are used to approximate the integral, the likelihood function is now non-differentiable and gradient-based maximization methods can easily fail. A Bayesian has a choice of whether to use a random walk or similar MCMC method directly on the likelihood function evaluated by simulation-based integration or to augment the problem with latent variables. In either case, the Bayesian using stochastic search methods is not dependent on any smoothness in the likelihood function.

2.6 Decision theory Most of the recent Bayesian literature in marketing emphasizes the value of the Bayesian approach to inference, particularly in situations with limited information. Bayesian inference is only a special case of the more general Bayesian decisiontheoretic approach. Bayesian Decision Theory has two critical and separate components: (1) a loss function and (2) the posterior distribution. The loss function associates a loss with a state of nature and an action, L (a, θ ), where a is the action and θ is the state of nature. The optimal decision maker chooses the action so as



CHAPTER 2 Inference for marketing decisions

to minimize expected loss where the expectation is taken with respect to the posterior distribution. min L¯ (a) = L (a, θ ) p (θ |Data) dθ a

Inference about θ can be viewed as a special case of decision theory where the “action” is to choose an estimate based on the data. Model choice can also be thought of as a special case of decision theory. If the loss function associated with model choice is takes on the value of 1 if the model is correct and 0 if not, then the solution which minimizes expected loss is to select the model (from a set of models) with highest posterior probability (for examples and further details see Chapter 6 of Rossi et al., 2005).

2.6.1 Firms profits as a loss function In the Bayesian statistical literature, decision theory has languished as there are few compelling loss functions, only those chosen for mathematical convenience. The loss function must come from the subject area of application and is independent of the model. That is to say, a principle message of decision theory is that we use the posterior distribution to summarize the information in the data (via likelihood) and prior and that decisions are made as a function of that information and a loss function. In marketing, we have a natural loss function, namely, the profit function of the firm. Strictly speaking, the profit function is not a loss function which we seek to minimize. We maximize profits or minimize the negative of profits. To take the simple sales response model presented here, the profit function here would be   π (x|θ) = E y|x, θ (p − c (x)) = f (x|θ ) (p − c (x)) (9) where y = f (x|θ ) + ε and c (x) is the cost of providing the vector of marketing inputs.10 Optimal decision theory prescribes that we should make decisions so as to maximize the posterior expectation of the profit function in (9). x ∗ = argmax π¯ (x) π¯ (x) = π (x|θ ) p (θ|Data) dθ The important message is that we act to optimize profits based on the posterior expectation of profits rather than inserting our “best guess” of the response parameters (the plug-in approach) and proceeding as though this estimate is the truth. The “plug-in” approach can be thought of as expressing overconfidence in the parameter estimates. If the profit function is non-linear in θ then the plug-in and full decision theoretic 10 It is a simple matter to include covariates out of the control of the firm in the sales response surface.

Optimal decision could either be done conditional on this vector of covariates or the covariates could be integrated out according to some predictive distribution.

2 Frameworks for inference

approaches will yield different solutions and the plug-in approach will typically overstate potential profits. In commercial applications of marketing research, many firms offer what are termed marketing mix models. These models are built to help advise firms how to allocate their budgets over many possible marketing activities including pricing, trade promotions, and advertising of many kinds including TV, print, and various forms of digital advertising. The options in digital advertising have exploded and now include sponsored search, web-site banner ads, product placement ads, and social advertising. The marketing mix model is designed to attack the daunting task of estimating the return on each of these activities and making predictions regarding the consequence of possible reallocation of resources on firm profits. As indicated above, the preferred choice for estimation in the marketing mix applications are Bayesian methods applied to sets of regression models. However, in making recommendations to clients, the marketing mix modeler simply “plugs-in” the Bayes estimates and is guilty of overconfidence. The problem with this approach is that, if taken literally, the conclusion is often to put all advertising resources in only one “bucket” or type of advertising. The full decision theoretic approach avoids these problems created by overconfidence in parameter estimates.

2.6.2 Valuation of information sets An important problem in modern marketing is the valuation of information. Firms have an increasing extensive array of possible information sets on which to base decisions. Moreover, acquisition of information is considered a major part of strategic decisions. For example, Amazon’s recent acquisition of Whole Foods as opening of brick and mortar stores has been thought to be motivated by the rich set of offline information which can be accumulated by observing Amazon customers in these store environments. In China, retailing giants Alibaba and JD.com have built what some term “data ecosystems” that can link customers across many different activities including web-browsing, social media, and on-line retail. On-line ad networks and programmatic ad platforms offer unprecedented targeting opportunities based on information regarding consumer preferences and behavior. The assumption behind all of these developments is that new sources of information are extremely valuable. The valuation of information is clearly an important part of marketing. One way of valuing information is in the ability of this new information to provide improved estimates of consumer preferences and, therefore, more precise predictions of consumer response to various marketing activities. However, statistically motivated estimation and prediction criteria such as mean squared error do not place a direct monetary valuation on information. This can only be obtained in a specific decision context and a valid loss function such as firm profits. To make this clear, consider two information sets, A and B, regarding sales response parameters. We can value these information sets by solving the decision theoretic problem and comparing the attainable expected



CHAPTER 2 Inference for marketing decisions

profits for the two information sets. That is, we can compute k = max π (x|θ) pk (θ ) dθ x

k = A, B where pk (θ) is the posterior based on information set k. In situations where decisions can be made at the consumer level rather than at the aggregate level, information set valuation can be achieved within a hierarchical model via different predictive distributions of consumer preferences based on alternative information sets (in Section 3, we will provide a full development of this idea).

2.7 Non-likelihood-based approaches 2.7.1 Method of moments approaches As indicated above, the original appeal of the Generalized Method of Moments (GMM) methods is that they use only a minimal set of assumptions consistent with the predictions of economic theory. However, when applied in marketing and demand applications, the methods of moments is often used as a convenient way of estimating the parameters of a demand model. Given a parametric demand model, there are a set of moment conditions which can identify the parameters of the model. These moment conditions can be used to define a method of moments estimator even for a fully parametric model. The method of moments approach is often chosen to avoid deriving the likelihood of the data and associated Jacobians. While this is true, the method of moments approach does not specify which set of moments should be used and there is often an infinite number of possible sets of moments conditions, any one of which is sufficient to identify and estimate model parameters. The problem, of course, is that method of moments provides little guidance as to which set of moments to use. The most efficient procedure for method of moments is to use the score function (gradient of the expected log-likelihood) to define the moment conditions. This means, of course, that one can only approach the asymptotic efficiency of the maximum likelihood estimator. In other situations, the method of moments approach is used to estimate a model which is only partially specified. That is, most parts of the model have a specific parametric form and specific distributional assumptions, but investigators purport to be reluctant to fully specify other parts of the model. We do not understand why it is defensible to make full parametric assumptions about part of a model but not others when there is no economic theory underpinning any of the parametric assumptions made. GMM advocates would argue that fewer assumptions is always better than more assumptions. The problem, then, is which parts of the model are designated for specific parametric assumptions and which parts of the model are not? The utility of the GMM approach must be judged relative to the arguments made by the investigator in defense of a particular choice of which part of the model is left unspecified. Frequently, we see no arguments of this sort and, therefore, we conclude that the method of moments procedure was chosen primarily for reasons of convenience.

2 Frameworks for inference

The aggregate share model of Berry et al. (1995) provides a good example of this approach. The starting point for the BLP approach is to devise a model for aggregate share data that is consistent with valid demand models postulated at the individual level. For example, it is possible to take the standard multinomial logit model as the model governing consumer choice. In a market with a very large number of consumers, the market shares are the expected probabilities of purchase which would be derived by integrating the individual model over the distribution of heterogeneity. The problem is that, with a continuum of consumers, all of the choice model randomness would be averaged out and the market shares would be a deterministic function of the included choice model covariates. To overcome this problem, Berry et al. (1995) introduced an additional error term into consumer level utility which reflects a market-wide unobservable. For their model, the utility of brand j for consumer i and time period t is given by Uij t = Xj t θji + ηj t + εij t


where Xj t is a vector of brand attributes, θji is a k × 1 vector of coefficients, ηj t , is an unobservable common to all consumers, and εij t is the standard idiosyncratic shock (i.i.d. extreme value type I). If we normalize the utility of the outside good to zero, then market shares (denoted by sj t ) are obtained by integrating the multinomial   logit model over a distribution of consumer parameters, f θ i |δ , θ i = θ1i , . . . , θJi . δ is the vector of hyper-parameters which govern the distribution of heterogeneity.     exp Xj t θji + ηj t i |δ dθ i f θ sj t =

J 1 + k=1 exp Xkt θki + ηkt     = sij t θ i |Xt , ηt f θ i |δ dθ i While it is not necessary to assume that consumer parameters are normally distributed, most applications assume a normal distribution. In some cases, difficulties in estimating the parameters of the mixing distribution force investigators to further restrict the covariance matrix of the normal distribution to a diagonal matrix (see Jiang et al., 2009). Assume that θ i ∼ N θ¯ , , then the aggregate shares can be expressed as a function of aggregate shocks and the preference distribution parameters.   exp Xj t θ i + ηj t i |θ, dθ i = h ηt |Xt , θ¯ , (11) φ θ sj t =

J i 1 + k=1 exp Xkt θk + ηkt where ηt is the J × 1 vector of common shocks. If we make an additional distributional assumption regarding the aggregate shock, ηt , we can derive the likelihood. Given that we have already made specific assumptions regarding the form of the utility function, the distribution of the idiosyncratic choice errors, and the distribution of heterogeneity, this does not seem particularly restrictive. However, the recent literature on GMM methods for aggregate share models



CHAPTER 2 Inference for marketing decisions

does emphasize the lack of distributional assumptions regarding the aggregate shock. In theory, the GMM estimator should be robust to autocorrelated and heteroskedastic errors of an unknown form. We will assume that the aggregate shock is i.i.d. across both products and time periods and follows a normal distribution, ηj t ∼ N 0, τ 2 . The normal distribution assumption is not critical to the derivation of the likelihood; however, as Bayesians we must make some specific parametric assumptions. Jiang et al. (2009) propose a Bayes estimator based on a normal likelihood and document that this estimator has excellent sampling properties even in the presence of misspecification and, in all cases considered, has better sampling properties than a GMM approach (see Chen and Yang, 2007 and Musalem et al., 2009 for other Bayesian approaches). The joint density of shares at “time” t (in some applications of aggregate share models, shares are observed over time for one market and in other shares are observed for a cross-section of markets. In the latter case, the “t” index would index markets) can be obtained by using standard change of variable arguments.     π s1t , . . . , sJ t |X, θ¯ , , τ 2 = φ h−1 s1t , . . . , sJ t |X, θ¯ , |0, τ 2 IJ J(η→s)   −1 = φ h−1 s1t , . . . , sJ t |X, θ¯ , |0, τ 2 IJ J(s→η) (12) φ (·) is the multivariate normal density. The Jacobian is given by    ∂sj   J(s→η) =   ∂η  k  ∂sj −sij θ i sik θ i φ θ i |θ¯ , k = j =  i i i ¯ ∂ηk sij θ 1 − sik θ φ θ |θ , k = j

(13) (14)

It should be noted that, given the observed shares, the Jacobian is a function of only (see Jiang et al., 2009 for details). To evaluate the likelihood function based on (12), we must compute the h−1 function and evaluate the Jacobian. The share inversion function can be evaluated using the iterative method of BLP (see Berry et al., 1995). Both the Jacobian and the share inversion require a method for approximation of the integrals required to compute “expected share” as in (11). Typically, this is done by direct simulation; that is, averaging over draws from the normal distribution of consumer level parameters. It has been noted that the GMM methods can be sensitive to simulation error in the evaluation of the integral as well as errors in computing the share inversion. Since the number of integral estimates and share inversions is of the order of magnitude of the number of likelihood or GMM criterion evaluations, it would desirable, from a strictly numerical point of view, that the inference procedure exhibit little sensitivity to the number of iterations of the share inversion contraction or the number of simulation draws used in the integral estimates. Our experience is that the Bayesian methods that use stochastic search as opposed to optimization are far less sensitive to

2 Frameworks for inference

these numerical errors. For example, Jiang et al. (2009) show that the sampling properties of Bayes estimates are virtually identical when 50 or 200 simulation draws are used in the approximation of the share integrals; this is not true of GMM estimates.11 In summary, the method of moments approach has inferior sampling properties to a likelihood-based approach and the literature has not fully explored the efficiency losses of using method of moments procedures. The fundamental problem is that the set of moments conditions is arbitrary and there is little guidance as to how to choose the most efficient set of moment conditions.12

2.7.2 Ad hoc approaches Any function of the data can be proposed as an estimator. However, unless that function is suggested or derived from a general approach to inference that has established properties, there is no guarantee that proposed estimator will have favorable properties. For example, any Bayesian estimator with non-dogmatic priors will be consistent as is true with MLE and Method of Moment estimators under relatively weak conditions. Estimators outside these classes (Bayes, MLE, and MM), we term “ad hoc” estimators as there are no general results establishing the validity of the estimator procedure. Therefore, the burden is on the investigator proposing an estimator not based on established principles to demonstrate, at a minimum, that the estimator is consistent. Unfortunately, pure simulation studies cannot establish consistency. It is well known that various biased estimators (for example, Bayes) can have very favorable finite sample properties by exploiting the bias-variance trade-off. However, as the amount of sample information becomes greater in large samples, all statistical procedures should reduce bias. The fact that a procedure can be constructed to do well by some criterion for a few parameter settings does not insure this. This is the value of theoretical analysis. Unfortunately, there are examples in the marketing literature (see Chapter 3 on conjoint methods) of procedures that have been proposed that are not derived from an inference paradigm that insures consistency. That is not to say that these procedures are inconsistent, but that consistency should and has not been established. Our view is that consistency is a necessary condition which must be made in order to admit an estimation procedure for further evaluation. It may well be, that a procedure offers superior (or inferior) performance to existing methods but this must wait until this minimal property is established. Some of the suggestions provided in the marketing literature are based purely on optimization method without establishing that the criterion for optimization is derived from a valid probability model. Thus, establishing consistency for these procedures is apt to be difficult. In proposing new procedures, investigators should be well aware of the complete class theorem which states that all admissible estimators are Bayes estimators. In 11 It is possible to lessen the sensitivity of the method of moments approach to numerical inversion error

(Dubé et al., 2012). 12 The GMM literature does have results about the asymptotically optimal moment conditions but, for

many models, it is impractical to derive the optimal set of moment conditions.



CHAPTER 2 Inference for marketing decisions

other words, it is not possible to dominate a Bayes estimator in finite samples for a specific and well-defined model (likelihood). Thus, procedures which are designed to be competitive with Bayes estimators must be based on robustness considerations (that is procedures that make fewer distributional assumptions in hopes of remaining consistent across a broader class of models).

2.8 Evaluating models Given the wide variety of models as well as methods for estimation of models, a methodology for comparison of models is essential. One approach is to view model choice as a special case of Bayesian decision theory. This will lead the investigator to compute the posterior probability of a model. p (Mi |y) =

p (y|Mi ) p (Mi ) p (y)


where Mi denotes model i. Thus, Bayesian model selection is based on the “marginal” likelihood, p (y|Mi ), and the prior model probability. We can simply select models with the highest value of the numerator of (15) or we can average predictions across models using these posterior probabilities. The practical difficulty with the Bayesian approach to model selection is that the marginal likelihood of the data must be computed. (16) p (y|Mi ) = p (y|θ, Mi ) p (θ |Mi ) dθ This integral is difficult to evaluate using only the MCMC draws that are used to perform posterior inferences regarding the model parameters (see, for example, the discussion in Chapter 6 of Rossi et al., 2005). Not only are these integrals difficult to evaluate numerically, but the results are highly sensitive to the choice of prior. Note also that the proper priors would be required for unbounded likelihoods in order to obtain convergence of the integral which defines the marginal likelihood. We can view the marginal likelihood as the expected value of the likelihood taken over the prior parameter distribution. If you have a very diffuse (or dispersed) prior, then the expected likelihood can be small. Thus, the relative diffusion of priors for different models must be considered when computing posterior model probabilities. While this can be done, it often involves a considerable amount of effort beyond that required to perform inferences conditional on a model specification. However, when properly done, the Bayesian approach to model selection does afford a natural penalty for over-parameterized models (the asymptotic approximation known as the Schwarz approximation shows the explicit dependence of the posterior probability on model size; however, the Schwarz approximation is notoriously inaccurate and, therefore, not of great practical value). Given difficulties in implementing a true decision theoretic approach to model selection, investigators have searched for other methods which are more easily implemented while offering some of the benefits of the more formal approach. Investigators

3 Heterogeneity

are aware that, given the “greedy” algorithms that dominate estimation, in-sample measures of model fit will understate true model error in prediction or classification. For this reason, various predictive validation exercises have become very popular in both marketing and the Machine Learning literatures. The standard predictive validation exercise involves dividing the sample into two data sets (typically by random splits): (1) estimation dataset and (2) validation dataset. Measures of model performance such as MSE are computed by fitting the model to the estimation dataset and predicting out-of-sample on the validation data. This procedure will work to remove the over-fitting biases of in-sample measures of MSE. Estimation procedures which have favorable bias-variance trade-offs will perform well by the criteria of predictive validation. The need to make arbitrary divisions of the data into estimation and validation datasets can be removed by using a k-fold cross-validation procedure. In k-fold cross-validation, the data is divided randomly into k “folds” or subsets. Each fold is reserved for validation and the model is fit on the other k − 1 folds. This is averaged over many draws of the fold classification and can be shown to produce an unbiased estimate of the model prediction error criterion. While these validation procedures are useful in discriminating between various estimation procedures and models, some caution should be exercised in their application to marketing problems. In marketing, our goal is to optimize policies for selection of marketing variables and we must consider models that are policy invariant. If our models are not policy invariant, then we may find models that perform very well in pure predictive validation exercises make poor predictions for optimal policy determination. In Section 4, we will consider the problem of true causal inference. The causal function linking market variables to outcomes such as sales can be policy invariant. The need for causal inference may also motivate us to consider other estimation procedures and these are explored in Section 5.

3 Heterogeneity A fundamental premise of marketing is that customers differ both in preferences for product features as well as their sensitivities to marketing variables. Observable characteristics such as psycho-demographics can only be expected to explain a limited portion of the variation in tastes and responsiveness. Disaggregate data is required in order to measure customer heterogeneity.13 Typically, disaggregate data are obtained for a relatively large number of cross-sectional units but with a relatively short history of activity. In the consumer packaged goods industry, store level panel data are common, especially for retailers. There is also increased availability of customer level purchase data from specialized panels of consumers or from detailed purchase

13 Some argue that, with specific functional forms, the heterogeneity distribution can be determined from

aggregate data. Fundamentally, the functional forms of response models and the distribution of heterogeneity are confounded in aggregate data.



CHAPTER 2 Inference for marketing decisions

histories assembled from firm records. As the level of aggregation decreases, discrete features of sales data become magnified. The short time span of panel data coupled with the comparatively sparse information in discrete data means that we are unlikely to have a great deal of sample information about any one cross-sectional unit. If inference about unit-level parameters is important, then Bayesian approaches will be important. Moreover, the prior will matter and there must be reasonable procedures for assessing informative priors. Increasingly firms want to make decentralized marketing decisions that exploit more detailed disaggregate information. Examples include store or zone level pricing, targeted electronic couponing, and sales force activities in the pharmaceutical industry. All of these examples involve allocation of marketing resources across consumers or local markets and the creation of possibly customized marketing treatments for each unit. In digital advertising, the ability to target an advertising message at a very specific group of consumers, defined by both observable and behavioral measures, makes the modeling of heterogeneity even more important. The demands of marketing applications contrast markedly with applications in micro-economics where the average response to a variable is often deemed more important. However, even the evaluation of policies which are uniform across some set of consumers will require information about the distribution of preferences in order to evaluate the effect on social welfare. In this section, we will review approaches to modeling heterogeneity, particularly the Bayesian hierarchical modeling approach. We will also discuss some of the challenges that truly huge panel datasets offer for Bayesian approaches.

3.1 Fixed and random effects A generic formulation of the panel data inference problem is that we observe a crosssection of H units over “time.” The panel does not necessarily have to be balanced, namely each unit can have a different number of observations. For some units, the number of observations may be very small. In many marketing context, a new “unit” maybe “born” with 0 observations, but we still have to make predictions for this unit. For example, a pharmaceutical company has a very expensive direct sales force that calls on many key “accounts” which in this industry is defined as a prescribing physician. There may be some physicians with a long history of interaction with the company and others who are new “accounts” with no history of interaction. Any firm that acquires new customers over time faces the same problem. Many standard econometric methods simply have no answer to this problem. Let p (yh |θh ) be the model we postulate at the unit level. If the units are consumers or households, this likelihood could be a basic demand model, for example. Our goal is to make inferences regarding the collection {θh }. The “brute force” solution would be to conduct separate likelihood-based analyses for each cross-sectional unit (either Bayes or non-Bayes). The problem is that many units may have so little information (think singular X matrix or choice histories in which a unit did not purchase all of the choice alternatives) that the unit-level likelihood does not have a maximum. For

3 Heterogeneity

this reason, separate analyses is not practical. Instead, the coefficients allowed to be unit-specific is limited. The classic example is what is often called at “Fixed Effects” estimator which has its origins in a linear regression model with unit specific intercepts and common coefficients. yht = αh + β xht + εht


Here the assumption is that there is a common “effect” or coefficients on the x variables but individual-specific intercepts and that there are unit-level intercept parameters. A common interpretation of this set-up is that there is some sort of unobservable variable(s) which influences the outcome, y, and which varies across the units, h. With panel data, we can just label the effect of these unobservables as time-invariant intercepts and estimate the intercepts with panel data. Sometimes econometricians will characterize this as solving the “selection on unobservables” problem via the introduction of fixed effects. Advocates for this approach will explain that no assumptions are made regarding the distribution of these unobservables across units nor are the unobservables required to be independent of the included x variables. For a linear model, estimation of (17) is a simple matter of concentrating the {αh } out of the likelihood function for the panel data set. This can be done by either subtracting the unit-level means of all variables or by differencing over time. yht − y¯h. = β t (xht − xh. ) + εht − εh.


In most cases, there is no direct estimation of the intercepts but, instead, the demeaning operation removes a part of the variation of both the independent and dependent variables from the estimation of the β terms. It is very common for applied econometricians to use hundreds if not thousands of fixed effect terms in estimation of linear panel data models. The goal is to isolate or control for unobservables that might compromise clean estimation of the common β coefficients. Typically, a type of sensitivity analysis is done where groups of fixed effect terms are entered or removed from the specification and changes in the β estimates are noted. Of course, this approach to heterogeneity does not help if the goal is to make predictions regarding a new unit with no data or a unit with so little data even the fixed effect intercept estimator is no defined. The reason is that there is no common structure assumed for the intercept terms. Each panel unit is unique and there is not source of commonality of similarity across units. This problem does not normally trouble econometricians who are concerned more with “effect estimation” or estimating β rather than prediction for new units or units with insufficient information. In marketing applications, we have no choice – we must make predictions for all units in the analysis. The problems with the fixed effects approach to heterogeneity does not stop with prediction for units with insufficient data. The basic idea that unobservables have additive effects and simply change the intercept and the assumption of a linear mean function allows the econometrician to finesse the problem of estimating the fixed



CHAPTER 2 Inference for marketing decisions

effects by concentrating them out of the likelihood function. That is, there is a transformation of the data so that the likelihood function can be factored into two terms – one term does not involve β and the other term only involves the data through the transformation. Thus, we lose no information by “demeaning” or differencing the data. This idea does not extend to non-linear models such as discrete choice or nonlinear demand models. This problem is so acute that many applied econometricians fit “linear probability” models to discrete or binary data in order to use the convenience of the additive intercept fixed effects approach even thought they know that a linear probability model is very unlikely to fit their data well and must only be regarded as an approximation to the true conditional mean function. This problem with the fixed effects approach does not apply to the random coefficients model. In the random coefficient model, we typically do not distinguish between the intercept and slope parameters and simply consider all model coefficients to be random (iid) draws from some “super-population” or distribution. That is, we assume the following two part model: y ∼ p (y|θh )


θh ∼ p (θh |τ ) Here the second equation is the random coefficient model. Almost without exception, the random coefficient model is taken to be a multivariate normal model, θh ∼ N θ¯ , . It is also possible to parameter the mean of the random coefficient model by observable characteristics of each cross sectional unit, i.e. θ¯ = z where z is a vector of unit characteristics. However, there is still the possibility that there are unobservable unit characteristics that influence x explanatory variables in each unit response model. The random coefficient model assumes independence between the random effects and the levels of unit x variables conditional on z. Many regard this assumption as an drawback to the random coefficient model and point out that a fixed effects specification does not require this assumption. Manchanda et al. (2004) explicitly model the joint distribution of random effects and unit level x variables as one possible approach to relaxing the conditional independence assumption used in standard random coefficient models. If we start with a linear model, then the random coefficient model can be expressed as a linear regression model with a special structured covariance matrix as in yht = θ¯ t xht + εht + vht xht


Here θ = θ¯ + v, v ∼ N (0, ). Thus, we can regard the regression model as estimating the mean of the random coefficient distribution and the covariance matrix is inferred from the likelihood of the correlated and heteroskedastic error terms. This error term structure motivates the use of “cluster” covariance matrix estimators. Of course, what we are doing by substituting the random coefficient model into the unit level regression is integrating out the {θh } parameters. In the more general non-linear setting, those who insist upon doing maximum likelihood would regard the random

3 Heterogeneity

coefficient model as part of the model and integrate or “marginalize” out the unitlevel parameters. H  p (yh |θh ) p (θh |τ ) dθh (21)  (τ ) = h=1

Some call random coefficient models “mixture” models since the likelihood is a mixture of the unit level distribution over an assumed random coefficient distribution. Maximum likelihood estimation of random coefficient models requires an approximation to the integral in the likelihood function over τ .14 Investigators find that they must restrict the dimension of this integral (by assuming that only parts of the coefficient vector are random) in order to obtain reasonable results. As we will see below, Bayesian procedures finesse this problem via data augmentation. The set of random coefficients are “augmented” in the parameter space, exploiting the fact that given the random coefficients, inference regarding τ is often easily accomplished.

Mixed logit models Econometricians are keenly aware of the limitations of the multinomial logit model to represent the demand for a set of products. The multinomial logit model has only one price coefficient and thus the entire matrix of cross-price elasticities must be generated with that one parameter. This problem is often known as the IIA property of logit models. As is shown in Chapter 1, one way of interpreting the logit model with linear prices is as corresponding to demand derived from linear utility with extreme value random utility errors. Linear utility assumes that all products are perfect substitutes. The addition of the random utility error means that choice alternatives are no longer exact perfect substitutes but the usual iid extreme value assumption means that all products have substitutability differences that can be expressed as a function of market share (or choice probability) alone. However, applied choice modelers are quick to point out that if aggregate demand is formed as the integral of logits over a normal distribution of preferences, then this aggregate demand function no longer has the IIA property. Our own experience is that while this is certainly true as a mathematical statement that aggregate preferences often exhibit elasticity structures which are very close to those implied by IIA. Our experience is that high correlations in the mixing distribution are required to obtain large deviations form IIA in the aggregate demand system. Many of the claims regarding the ability of mixed logit to approximate arbitrary aggregate demand systems stem from a misreading of McFadden and Train (2000). A superficial reading of this article might imply that mixed logits can approximate any demand structure but this is only true if explanatory variables such as price are allowed to enter the choice probabilities in arbitrary non-linear ways. In some sense, it must always be true that any demand model can be approximated by arbitrary 14 This situation seems to be a clear case where simulated MLE might be used. The integral in (21) is

approximated by a set of R draws from the normal distribution.



CHAPTER 2 Inference for marketing decisions

functions of price. One should not conclude that mixtures of logits with linear price terms can be used to approximate arbitrary demand structures.

3.2 Bayesian approach and hierarchical models 3.2.1 A generic hierarchical approach Consider a cross-section of H units, each with a likelihood, p (yh |θh ) , h = 1, . . . , H . θh is a k × 1 vector. yh generically represents the data on the hth unit and θh is a vector of unit-level parameters. While there is no restriction on the model for each unit, common examples include a multinomial logit or standard regression model at the unit level. The parameter space can be very large and consists of the collection of unit level parameters, {θh , h = 1, . . . , H }. Our goal will be to conduct a posterior analysis of these joint set of parameters. It is common to assume that units are independent conditional on θh . More generally, if the units are exchangeable (see Bernardo and Smith, 1994), then we require a prior distribution which is the same no matter what the ordering of the units are. In this case, we can write down the posterior for the panel data as p (θ1 , . . . , θH |y1 , . . . , yH ) ∝


p (yh |θh ) p(θ1 , . . . , θH |τ )



τ is a vector of prior parameters. The prior assessment problem posed by this model is daunting as it requires specifying a potentially very high dimensional joint distribution. One simplification would be to assume that the unit-level parameters are independent and identically distributed, a priori. In this case, the posterior factors and inference can be conducted independently for each of the H units. p (θ1 , . . . , θH |y1 , . . . , yH ) ∝


p (yh |θh ) p(θh |τ )



Given τ , the posterior in (23) is the Bayesian analogue of the classical fixed effects estimation approach. However, there are still advantages to the Bayesian approach in that an informative prior can be used. The informative prior will impart important shrinkage properties to Bayes estimators. In situations in which the unit-level likelihood may not be identified, a proper prior will regularize the problem and produce sensible inferences. The real problem is a practical one in that some guidance must be provided for assessing the prior parameters, τ . The specification of the conditionally independent prior can be very important due to the scarcity of data for many of the cross-sectional units. Both the form of the prior and the values of hyper-parameters are important and can have pronounced on effects ¯ Vθ . Just the unit-level inferences. For example, consider a normal prior, θh ∼ N θ, the use of a normal prior distribution is highly informative regardless of the value of hyper-parameters. The thin tails of the prior distribution will reduce the influence of the likelihood when the likelihood is centered far away from the prior. For this

3 Heterogeneity

reason, the choice of the normal prior is far from innocuous. For many applications, the shrinkage of outliers is a desirable feature of the normal prior. The prior results in very stable estimates but at the same time this prior might mask or attenuate differences in consumers. It will, therefore, be important to consider more flexible priors. In other situations, the normal prior may be inappropriate. Consider the problem of random coefficient distribution of price coefficients in a demand model. Here we expect that the population distribution puts mass only on negative values and that the distribution would be highly skewed and possibly with a fat left tail. The normal random coefficient distribution would not be appropriate. It is a simple matter to reparameterize the price coefficient as in βp = − exp(βp∗ ) where we assume βp∗ is normal. But the general point that the normal distribution is restrictive is important to note. That is why we have enlarged these models to consider a finite or even infinite mixture of normals which can flexibly approximate any continuous distribution. If we accept the normal form of the prior as reasonable, a method for assessing the prior hyper-parameters is required (Allenby and Rossi, 1999). It may be desirable to adapt the shrinkage induced by use of an informative prior to the characteristics of both the data for any particular cross-sectional unit as well as the differences between units. Both the location and spread of the prior should be influenced by both the data and our prior beliefs. For example, consider a cross-sectional unit with little information available. For this unit, the posterior should shrink toward some kind of “average” or representative unit. The amount of shrinkage should be influenced both by the amount of information available for this unit as well as the amount of variation across units. A hierarchical model achieves this result by putting a prior on the common parameter, τ . The hierarchical approach is a model specified by a sequence of conditional distributions, starting with the likelihood and proceeding to a two-stage prior. p (yh |θh ) p (θh |τ )


p (τ |a) The prior distribution on θh |τ is sometimes called the first stage prior. In nonBayesian applications, this is often called a random effect or random coefficient model and is regarded as part of the likelihood. The prior on τ completes the specification of a joint prior distribution on all model parameters and is often called the “second-stage” prior. Here a is a vector of prior hyper-parameters which must be assessed or chosen by the investigator.

p (θ1 , . . . , θH , τ |h) = p (θ1 , . . . , θH |τ ) p (τ |a) =

H  h=1

p (θh |τ ) p (τ |a)




CHAPTER 2 Inference for marketing decisions

One way of regarding the hierarchical model is just as a device to induce a joint prior on the unit-level parameters, that is we can integrate out τ to inspect the implied prior. p (θ1 , . . . , θH |a) =


p (θh |τ ) p (τ |a) dτ



It should be noted that, while {θh } are independent conditional on τ , the implied joint prior can be highly dependent, particularly if the prior on τ is diffuse (note: it is sufficient that the prior on τ should be proper in order for the hierarchical model to specify a valid joint distribution). To illustrate this, consider a linear model, θh = τ + vh . τ acts as common variance component and the correlation between any two θ s is Corr (θh , θk ) =


στ2 + σv2

As the diffusion of the distribution of τ relative to v increases, this correlation tends toward one.

3.2.2 Adaptive shrinkage The popularity of the hierarchical model stems from the improved parameter estimates that are made possible by the two stage prior distribution. To understand why this is the case, let’s consider a simplified situation in which each unit level likelihood is approximately normal with mean θˆMLE and covariance matrix, Ih−1 (here we are abstracting issues involving existence of the MLE). If we have a normal prior, from θh ∼ N θ¯ , Vθ , then conditional on the normal prior parameters the approximate posterior mean is given by   −1  θ˜ = Ih + Vθ−1 Ih θˆh + Vθ−1 θ¯


This equation demonstrates the principle of shrinkage. The Bayes estimator is a compromise between the MLE (where the unit h likelihood is centered) and the prior mean. The weights in the average depend on the information content of the unit level likelihood and the variance of the prior. As we have discussed above, shrinkage is what gives Bayes estimators such excellent sampling properties. The problem becomes where should we shrink toward and by how much? In other words, how do we assess the normal mean and variance-covariance matrix. The two-part prior in the hierarchical model means that the data will be used (in part) to assess these parameter values. The “mean” of the prior will be something like the mean of the θh parameters over units and the variance will summarize the dispersion or extent of heterogeneity. This means that if we think that all units are very similar (Vθ is small) then we will shrink a lot. In this sense the hierarchical Bayes procedures has “adaptive shrinkage.” Note also that, for any fixed amount of heterogeneity, units with a great deal of information regarding θh will not be shrunk much.

3 Heterogeneity

3.2.3 MCMC schemes Given the independence of the units conditional on θh , all MCMC algorithms for hierarchical models will contain two basic groups of conditional distributions. p (θh |yh , τ ) ,

h = 1, . . . , H

p (τ | {θh } , a)


As is well-known, the second part of this scheme exploits the conditional independence of yh and τ . The first part of (28) is dependent on the form of the unit-level likelihood, while the second part depends on the form of the first stage prior. Typically, the priors in the first and second stages are chosen to exploit some sort of conjugacy and the {θh } are treated as “data” with respect to the second stage.

3.2.4 Fixed vs. random effects In classical approaches, there is a distinction made between a “fixed effects” specification in which there are different parameters for every cross-sectional unit and random effects models in which the cross-sectional unit parameters are assumed to be draws from a super-population. Advocates of the fixed effects approach explain that the approach does not make any assumption regarding the form of the distribution or the independence of random effects from included covariates in the unit-level likelihood. The Bayesian analogue of the fixed effects classical model is an independence prior with no second-stage prior on the random effects parameters as in (23). The Bayesian hierarchical model is the Bayesian analogue of a random effects model. The hierarchical model assumes that each cross-sectional unit is exchangeable (possibly conditional on some observable variables). This means that a key distinction between models (Bayesian or classical) is what sort of predictions could be made for a new cross-sectional unit. In either the classical or Bayesian “fixed effects” approach, no predictions can be made about a new member of the cross-section as there is no model linking units. Under the random effects view, all units are exchangeable and the predictive distribution for the parameters of a new unit is given by p (θh∗ |y1 , . . . , yH ) =

p (θh∗ |τ ) p (τ |y1 , . . . , yH ) dτ


3.2.5 First stage priors Normal prior A straightforward model to implement is a normal first stage prior with possible covariates. θh =  zh + vh ,

vh ∼ N (0, Vθ )


where zh is a d × 1 vector of observable characteristics of the cross-sectional unit.  is a d × k matrix of coefficients. The specification in (30) allows the mean of each of the elements of θh to depend on the z vector. For ease of interpretation, we find it



CHAPTER 2 Inference for marketing decisions

useful to subtract the mean and use an intercept. ¯ zh = (1, xh − x) In this formulation, the first row of  can be interpreted as the mean of θh . (30) specifies a multivariate regression model and it is convenient, therefore, to use the conjugate prior for the multivariate regression model. Vθ ∼ I W V , ν ¯ Vθ ⊗ A−1 δ = vec () |Vθ ∼ N δ,


A is a d × d precision matrix. vec () is the stacks the columns of a matrix up and ⊗ denotes Kronecker product. This prior specification allows for direct one-for-one draws of the common parameters, δ and Vθ .

Mixture of normals prior While the normal distribution is flexible, there is no particular reason to assume a normal first-stage prior. For example, if the observed outcomes are choices among products, some of the coefficients might be brand specific intercepts. Heterogeneity in tastes for a product might be more likely to assume the form of clustering by brand. That is, we might find “clusters” of consumers who prefer specific brands over other brands. The distribution of tastes across consumers might then be multi-modal. We might want to shrink different groups of consumers in different ways or shrink to different group means. A multi-modal distribution will achieve this goal. For other coefficients such as a price sensitivity coefficient, we might expect a skewed distribution centered over negative values. Mixtures of multivariate normals are one way of achieving a great deal of flexibility (see, for example, Griffin et al., 2010 and the references therein). Multi-modal, thick-tailed, and skewed distributions are easily achieved from mixtures of a small number of normal components. For larger numbers of components, virtually any joint continuous distribution can be approximated. The mixture of normals model for the first-stage prior is given by θh =  zh + vh vh ∼ N (μind , ind )


ind ∼ MN (π) π is a K × 1 vector of multinomial probabilities. This is a latent version of a mixture of K normals model in which a multinomial mixture variable, denoted here by ind, is used. In the mixture of normal specification, we remove the intercept term from zh and allow vh to have a non-zero mean. This allows the normal mixture components to mix on the means as well as on scale, introducing more flexibility. As before, it is convenient to demean the variables in z. A standard set of conjugate priors can be used for the mixture probabilities and component parameters, coupled with a standard

3 Heterogeneity

conjugate prior on the  matrix.   ¯ A−1 δ = vec () ∼ N δ, δ π ∼D α   μk ∼ N μ, ¯ k ⊗ a −1 μ k ∼ I W V , ν


Assessment of these conjugate priors is relatively straightforward for diffuse settings. Given that the θ vector can be of moderately large dimension (>5) and the θh parameters are not directly observed, some care must be exercised in the assessment of prior parameters. In particular, it is customary to assess the Dirichlet portion of the prior by using the interpretation that the K × 1 hyper-parameter vector, α, is an observed classification of a sample of size, α k , into the K components. Typically, all components in α are assessed equal. When a large number of components are used, the elements of α should be scaled down in order to avoid inadvertently specifying an informative prior with equal prior probabilities on a large number of components. We suggest a setting of α k = .5/K (see Rossi, 2014a, Chapter 1 for further discussion). As in the single component normal model, we can exploit the fact that, given the H × k matrix, , whose columns consist of each θh values and standard conditionally conjugate priors in (33), the mixture of normals model in (32) is easily handled by a standard unconstrained Gibbs sampler which includes augmentation to include the latent vector of component indicators (see Rossi et al., 2005, Section 5.5.1). The latent draws can be used for clustering as discussed below. We should note that any label-invariant quantity such as a density estimate or clustering is not affected by the “label-switching” identification problem (see Fruhwirth-Schnatter, 2006 for a discussion). In fact, the unconstrained Gibbs sampler is superior to various constrained approaches in terms of mixing. A tremendous advantage of Bayesian methods when applied to mixtures of normals is that, with proper priors, Bayesian procedures do not overfit the data and provide reasonable and smooth density estimates. In order for a component to obtain appreciable posterior mass, there must be enough structure in the “data” to favor the component in terms of a Bayes factor. As is standard in Bayesian procedures, the existence of a prior puts an implicit penalty on models with a larger number of components. It should also be noted that the prior for the mixture of normals puts positive probability on models with less than K components. In other words, this is really a prior on models of different dimensions. In practice, it is common for the posterior mass to be concentrated on a set of components of much smaller size than K. The posterior distribution of any ordinate of the joint (or marginal densities) of the mixture of normals can be constructed from the posterior draws of component parameters and mixing probabilities. In particular, a Bayes estimate of a density ordinate



CHAPTER 2 Inference for marketing decisions

can be constructed. 1

r πk φ θ|μrk , kr dˆ (θ) = R R



r=1 k=1

Here the superscript r refers to an MCMC posterior draw and φ (·) the k-variate multivariate normal density. If marginals of sub-vectors of θ are required, then we simply compute the required parameters from the draws of the joint parameters.

3.2.6 Dirichlet process priors While it can be argued that a finite mixture of normals is a very flexible prior, it is true that the number of components must be pre-specified by the investigator. Given that Bayes methods are being used, a practical approach would be to assume a very large number of components and allow the proper priors and natural parsimony of Bayes inference to produce reasonable density estimates. For large samples, it might be reasonable to increase the number of components in order accommodate greater flexibility. The Dirichlet Process (DP) approach can, in principle, allow the number of mixture components to be as large as the sample size and potentially increase with the sample size. This allows for a claim that a DP prior can facilitate general non-parametric density estimation. Griffin et al. (2010) provide a discussion of the DP process approach to density estimation. We review only that portion of this method necessary to fix notation for use within a hierarchical setting. Consider a general setting in which each θh is drawn from a possibly different multivariate normal distribution. θh ∼ N (μh , h ) The DP process prior is a hierarchical prior on the joint distribution of {(μ1 , 1 ) , . . . , (μH , H )}. The DP prior has the effect of grouping together cross-section units with the same value of (μ, ) and specifying a prior distribution for these possible “atoms.” The DP process prior is denoted G (α, G0 (λ)). G (·) specifies a distribution over distributions that is centered on the base distribution, G0 , with tightness parameter, α. Under the DP prior, G0 is the marginal prior distribution for the parameters for any one cross-sectional unit. α specifies the prior distribution on the clustering of units to a smaller number of unique (μ, ) values. Given the normal base distribution for the cross-sectional parameters, it is convenient to use a natural conjugate base prior.   1 (35) ¯ × h , h ∼ I W V , ν G0 (λ) : μh | h ∼ N μ, a   λ is the set of prior parameters in (35): μ, a, ν, V . In our approach to a DP model, we also put priors on the DP process parameters, α and λ. The Polya Urn representation of the DP model can be used to motivate the choice of prior distributions on these process parameters. α influences the number

3 Heterogeneity

of unique values of (μ, ) or the probability that a new set of parameter values will be “proposed” from the base distribution, G0 . λ governs the distribution of proposed values. For example, if we set λ to put high prior probability on small values of , then the DP prior will attempt to approximate the density of parameters with normal components with small variance. It is also important that the prior on μ put support on a wide enough range of values to locate normal components at wide enough spacing to capture the structure of the distribution of parameters. On the other hand, if we set very diffuse values of λ then this will reduce the probability of the “birth” of a new component via the usual Bayes Factor argument. α induces a distribution on the number of distinct values of (μ, ) as shown in Antoniak (1974).    (α)   (36) P r I ∗ = k = Sn(k)  α k  (n + α) Sn are Sterling numbers of the first kind. I ∗ is the number of clusters or unique values of the parameters in the joint distribution of (μh , h , h = 1, . . . , H ). It is common in the literature to set a Gamma prior on α. Our approach is to propose a simple and interpretable distribution for α. (k)

  α − αl φ p (α) ∝ 1 − αu − αl


α ∈ α l , α u . We assess the support of α by setting the expected minimum and max∗ and I ∗ . We then invert to obtain the bounds of imum number of components, Imin max support for α. Rossi (2014b), Chapter 2, provides further details including the assessment of the φ hyper-parameter. It should be noted that this device does not restrict the support of the number of components but merely assesses an informative prior that puts most of the mass of the distribution of α on values which are consistent with the specified range in the number of unique components. A draw from the posterior distribution of α can easily be accomplished as I ∗ is sufficient and we can use a griddy Gibbs sampler as this is simply a univariate draw. Priors on λ (35) can be also be implemented by setting μ ¯ = 0 and letting ν vIk . This paramV = νvIk . If ∼ I W (νvIk , ν), then this implies mode ( ) = ν+2 eterization helps separate the choice of a location for the matrix (governed by v) from the choice of the tightness on the prior for (ν). In this parameterization, there are three scalar parameters that govern the base distribution, (a, v, ν). We take them to be a priori independent with the following distributions. p (a, v, ν) = p (a) p (v) p (ν) a ∼ U al , au v ∼ U vl , vu

ν = dim (θh ) − 1 + exp (z) , z ∼ U zl , zu , zl > 0




CHAPTER 2 Inference for marketing decisions

It is a simple matter to write down the conditional posterior given that the unique The set of I ∗ unique parameter values is denoted  set of (μ, ) are sufficient.  ∗ μ∗i , j∗ , j = 1, . . . , I ∗ . The conditional posterior is given by

p a, v, ν|

I∗    a  

−1 ∗  −1 ∗ −1/2 exp − μ∗i i∗ μi ∝ a i  2 i=1     1 ν/2  ∗ −(ν+k+1)/2 ∗ |νvIk | etr − νv i p (a, v, ν) i 2


We note that the conditional posterior factors and that, conditional on ∗ , a and (ν, v) are independent.

3.2.7 Discrete first stage priors Both the economics (Heckman and Singer, 1984) and marketing literatures (see, for example, references in Allenby and Rossi, 1999) considered the use of discrete random coefficient or mixing distributions. In this approach, the first state prior is a discrete distribution which puts mass on only a set of M unknown mass points. Typically, some sort of model selection criterion is used to select the number of mass points (such as BIC or AIC). As discussed in Allenby and Rossi (1999), discrete mixtures are poor approximations to continuous mixtures. We do not believe that consumer preferences consist of only a small number of types but, rather, a continuum of preferences which can be represented by a flexible continuous distribution. One can see this is a degenerate special case of the mixture of normals approach. Given that it is now feasible to use not only mixture of normals but mixtures with a potentially infinite number of components, the usefulness of the discrete approximation has declined. In situations where the model likelihood is extremely costly to evaluate (such as some dynamic models of consumer behavior and some models of search), the discrete approach retains some appeal from pure computational convenience.

3.2.8 Conclusions In summary, hierarchical models provide a very appealing approach to modeling heterogeneity across units. Today almost all conjoint modeling (for details see Chapter 3) is accomplished using hierarchical Bayesian procedures applied to a unit level multinomial logit model. With more aggregate data, Bayesian hierarchical models are frequently employed to insure high dimensional systems of sales response equations produce reasonable coefficient estimates. Given that there are a very large number of cross-sectional units in marketing panel data, there is an opportunity to move considerably beyond the standard normal distribution of heterogeneity. The normal distribution might not be expected to approximate the distribution of preferences across consumers. For example, brand preference parameters might be expected to be multi-modal while marketing mix sensitivity parameters such as a price elasticity or advertising responsiveness may

3 Heterogeneity

be highly skewed and sign-constrained distributions.15 For this reason, mixture-ofnormal priors can be very useful (see, for example, Dube et al., 2010).

3.3 Big data and hierarchical models It is not uncommon for firms to assemble panel data on millions of customers. Extremely large panels pose problems for the estimation and use of hierarchical models. The basic Gibbs sampler strategy in (28) means alternating between drawing unit level parameters (θh ) and the common second-stage prior parameters (τ ). Clearly, the unit level parameters can be draw in parallel which exploits a modern distributed computing environment with many loosely coupled processors. Moreover, the amount data which have to be sent to each processor undertaking unit-level computations is small. However, the draws of the common parameters require assembling all unit level parameters, {θh }. The communications overhead of assembling unit level parameters may be prohibitive. One way out of this computational bottleneck, while retaining the current distributed computer architecture, is to perform common parameter inferences on a subset (but probably a very large subset) of the data. The processor farm could be used to draw unit level parameters conditional on draws of the common parameters which have already been accomplished and reserved for this purpose. Thus, an initial stage of computation would be to implement the MCMC strategy on a large subset of units. Draws of the common parameters from this analysis would be reserved. In a second stage, common parameter draws would be sent down along with unit data to a potentially huge group of processors that would undertake parallel unit level computations. This remains an important area for computational research.

3.4 ML and hierarchical models The Machine Learning (ML) literature has emphasized flexible models which are evaluated primarily on predictive performance. While the theory of estimation does not discriminate between two equally flexible approaches to estimating non-linear and unknown regression functions, as a practical matter investigators in the ML have found that certain functional forms or sets of basis functions appear to do very well in approximating arbitrary regression functions. One could regard the entire hierarchical model approach entirely from a predictive point of view. To predict some unit level outcome variable, we should be able to use any function of the other observations on this unit. That is to predict, yht0 , we can use any other data on unit h except for yht0 . The hierarchical approach also suggests that summaries of the unit data (such as means and variance of unit parameters) might also be helpful in predicting yht0 . This suggests that we might consider ways of 15 For example, the price coefficient could be reparameterized as in β = −eδ and the first stage prior p

could be placed on δ. This will require care in assessment of priors as the δ parameter is on a log-scale while other parameters will be on a normal scale.



CHAPTER 2 Inference for marketing decisions

training flexible ML methods to imitate or approximate the predictions that arise from a hierarchical approach and use these as approximate solutions to fitting hierarchical models when it is impractical to implement the full-blown MCMC apparatus. Again, this might be a fruitful avenue for future research.

4 Causal inference and experimentation As we have indicated, one fundamental goal of marketing research is to inform decisions which firms make about the deployment of marketing resources. At the core, all firm decisions regarding marketing involve counterfactual reasoning. For example, we must estimate what a potential customer would do had they not been exposed to a paid search ad in order to “attribute” the correct sales response estimate to this action. Marketing mix models pose a much more difficult problem of valid counterfactual estimates of what would happen to sales and profits if marketing resources were re-allocated in a different manner than observed in the past. The importance of counterfactual reasoning in any problem related to optimization of resources raises the ante for any model of customer behavior. Not only must this model match the co-variation of key variables in the historical data, but the model must provide accurate and valid forecasts of sales in a new regime with a different set of actions. This means that we must identify the causal relationship between marketing variables and firm sales/profits and this causal relationship must be valid over a wide range of possible actions, including actions outside of the support of historical data. The problem of causal inference has received a great deal of attention in the bio-statistics and economic literatures, but relatively little attention in the marketing literature. Given that marketing is, by its very nature, a decision-theoretic field, this is somewhat surprising. The problems in the bio-statistics and economics applications are usually evaluating the causal effect of a “treatment” such as a new drug or a job-training program. Typically, the models used in these literatures are simple linear models. Often the goal is to estimate a “local” treatment effect. That is, a treatment effect for those induced by an experiment or other incentives to become treated. A classic example from this literature is the Angrist and Krueger (1991) paper which starts with the goal of estimating the returns to an additional year of schooling but ends up only estimating (with a great deal of uncertainty) the effect of additional schooling for those induced to complete the 10th grade (instead of leaving school in mid-year). To make any policy decisions regarding investment in education, we would need to know the entire causal function (or at least more than one point) for the relationship between years of education and wages. The analogy in marketing analytics is to estimate the causal relationship between exposures to advertising and sales. In order to optimize the level of advertising, we require the whole function not just a derivative at a point. Much of the highly influential work of Heckman and Vytlacil (2007) has focused on the problem of evaluating job training programs where the decision to enroll in

4 Causal inference and experimentation

the program is voluntary. This means that those people who are most likely to benefit from the job training program or who have the least opportunity cost of enrolling (such as the recently unemployed) are more likely to be treated. This raises a host of thorny inference problems. The analogy in marketing analytics is to evaluate the effect of highly targeted advertising. Randomized experimentation offers at least a partial solution to the problems of causal inference. Randomization in assignment to treatment conditions can be exploited as the basis of estimators for causal effects. Both academic researchers and marketing practitioners have long advocated the use of randomized experiments. In the direct marketing and credit card contexts, randomized field experiments have been conducted for decades to optimize direct marketing offers and manage credit card accounts. In advertising, IRI International used randomized experiments implemented through split cable to evaluate TV ad creatives (see Lodish and Abraham, 1995). In the early 1990s, randomized store-level experiments were used to evaluate pricing policies by researchers at the University of Chicago (see Hoch et al., 1994). In economics, the Income-Maintenance experiments of the 1980s stimulated an interest in randomized social experiments. These income maintenance experiments were followed by a host of other social experiments in housing and health care.

4.1 The problem of observational data In the generic problem of estimating the relationship between sales and marketing inputs, the goal is to make causal inferences so that optimization is possible on the basis of our estimated relationship. The problem is that we often have only observational data on which to base our inferences regarding the causal nexus between marketing variables and sales. There is a general concern that not all of the variation in marketing input variables can be considered exogenous or as if the variation is the result of random experimentation. Concerns that some of the variation in the right hand side variables is correlated with the error term or jointly determined with sales mean that observational data may lead to biased or inconsistent causal inferences. For example, suppose we have aggregate time series data16 on the sales of a product and some measure of advertising exposure. St = f (At |θ ) + εt Our goal is to infer the function, f , which can be interpreted as a causal function, that is, we can use this function to make valid predictions of expected sales for a wide range of possible values of advertising. In order to consider optimizing advertising, 16 In the general case, assembling even observational data to fit a market response model can be difficult.

At least three or possible four different sources are required: (1) Sales data, (2) Pricing and promotional data, (3) Digital advertising, and (4) Traditional advertising such as TV, Print, and Outdoor. Typically, these various data sources feature data at various levels of temporal, geographic, and product aggregation. For example, advertising is typically not associated with a specific product but with a line of products and may only be available at the monthly or quarterly level.



CHAPTER 2 Inference for marketing decisions

we require a non-linear function which, at least at some point, exhibits diminishing returns. Given that we wish to identify a non-linear relationship, we will require more extensive variation in A than if we assume a linear approximation. The question from the point of view of causal inference is whether or not we can use the variation in the observed data to make causal inferences. As discussed in Section 2.3, the statistical theory behind any likelihood-based inference procedure for such a model assumes the observed variation in A is as though obtained via random experimentation. In a likelihood-based approach, we make the assumption that the marginal distribution of A is unrelated to the parameters, θ , which drive the conditional mean function. An implication of this assumption is that the conditional mean function is identified only via the effect of changes in A; the levels of A have no role in inference regarding the parameters that govern the derivative of f () with respect to A. In practice, this may not be true. In general, if the firm sets the values of A observed in the data on the basis of the function f (), then the assumption that the marginal distribution of A is not related to θ is violated. In this situation, we may not be able to obtain valid (consistent) estimates of the sales response function parameters.17 Manchanda et al. (2004) explain how a model in which both inputs are chosen jointly can be used to extract causal information from the levels of an advertising input variable. However, this approach requires additional assumptions about how the firm chooses the levels of advertising input. Another possibility is that there is some unobservable variable that influences both advertising and sales. For example, suppose there are advertising campaigns for a competing product that is a close substitute and we, as data scientists, are not aware of or cannot observe this activity. It is possible that, when there is intensive activity from competitive advertising, the firm increases the scale of its advertising to counter or blunt the effects of competitive advertising. This means that we no longer estimate the parameters of the sales response function consistently. In general, anytime the firm sets A with knowledge of some factor that also affects Sales and we do not observe this factor, we will have difficulty recovering the sales response function parameters. In some sense, this is a generic and non-falsifiable critique. How do we know that such an unobservable does not exist? We can’t prove it. Typically, the way we might deal with this problem is to include as large a possible set of covariates in the sales equation as control variables. The problem in sales response model building is that we often do not observe any actions of competing products or we only observe these imperfectly and possibly at a different time frequency. Thus, one very important set of potential control variates is often not available. Of course, this is not the only possible set of variables observable to the firm but not observable to the data scientist. There are three possible ways to deal with this problem of “endogeneity.”

17 An early example is Bass (1969), with a model of the simultaneous determination of sales and advertis-

ing is calibrated using cigarette data. Bass suggested that ad hoc rules which allocate advertising budgets as some percentage of sales create a feedback loop or simultaneity problem.

4 Causal inference and experimentation

1. We might consider using data sampled at a much higher frequency than the decisions regarding A are made. For example, if advertising decisions are made only quarterly, we might use weekly data and argue that the lion’s share of variation in our data holds the strategic decisions of the firm constant.18 2. We might attempt to partition the variation in A into that which is “clean” or unrelated to factors driving sales and that which is. This is the logical extension of the conditioning approach of adding more observables to the model. We would then use an estimation method with uses only the “clean” portion of the variation. 3. We could consider experimentation to break whatever dependence there is between the advertising and sales. Each of these ideas will be discussed in detail below. Before we embark on a more detailed discussion of these methods, we will relate our discussion of simultaneity or endogeneity to the literature on causal inference for treatment effects.

4.2 The fundamental problem of causal inference A growing literature (see, for example, Angrist and Pischke, 2009 and Imbens and Rubin, 2014) emphases a particular formulation of the problem of causal inference. Much of this literature re-interprets existing econometric methods in light of this paradigm. The basis for this paradigm of causal inference was originally suggested by Neyman (1990) who conceived of the notion of potential outcomes for a treatment. The notation favored by Imbens and Rubin is as follows. Y represents the outcome random variable. In our case, Y will be sales or some sort of event (like a conversion or click) which is on the way toward a final purchase. We seek to evaluate a treatment, denoted D. For now, consider an binary treatment such as exposure to an ad.19 We conceive of there being two potential outcomes: • Yi (1): potential outcome if unit i is exposed to the treatment. • Yi (0): potential outcome if unit i is not exposed to the treatment. We would like to estimate the causal effect of the treatment which is defined as i = Yi (1) − Yi (0) The fundamental problem of causal inference is that we only see one of two potential outcomes for each unit being treated. That is, we only observe Yi (1) for Di = 1 and Yi (0) for Di = 0. Without further assumptions or information, this statistical 18 Here we are assuming that within variation in A is exogenous. For example, if promotions or ad

campaigns are designed at the quarterly level, then we are assuming that within quarter variation is execution-based and unrelated to within quarter demand shocks. The validity of this assumption would have to be assessed in the same way that any argument for exogeneity is made. However, this exploits institutional arrangements that may well be argued are indeed exogenous. 19 It is a simple matter to extend potential outcomes framework a more continuous treatment variables such as in causal inference with respect to the effect of price on demand.



CHAPTER 2 Inference for marketing decisions

problem is un-identified. Note that we have already simplified the problem greatly by assuming a linear model or restricting our analysis to only one “level” of treatment. Even if we simplify the model by assuming a constant treatment effect, i =  ∀i, the causal effect is still not identified. To see this problem, let’s take the mean differences in Y between those who were treated and not treated and express this in terms of potential outcomes. E [Yi |Di = 1] − E [Yi |Di = 0] = E [Yi (1) |Di = 1] − E [Yi (0) |Di = 0] = E [Yi (1) |Di = 1] − E [Yi (0) |Di = 1] + E [Yi (0) |Di = 1] − E [Yi (0) |Di = 0] This equation simply states that what the data identifies is the mean difference in the outcome variable between the treated and untreated and this can be expressed as the sum of two terms. The first term is the effect on the treated, [Yi (1) |Di = 1] − E [Yi (0) |Di = 1], and the second term is called the selection bias, E [Yi (0) |Di = 1] − E [Yi (0) |Di = 0]. Selection bias occurs when the potential outcome for those assigned to the treatment differs in a systematic way from those who are assigned to the “control” or assigned not to be treated. This selection bias is what inspired much of the work of Heckman, Angrist, and Imbens to obtain further information. The classic example of this is the so-called “ability” bias argument in the literature on education. We can’t simply compare the wages of college graduates with those who did not graduate for college, because it is likely that college graduates have greater ability even “untreated” with a college education. Those who argue for the “certification” view of higher education are the extreme point of this selection bias – they argue that the only point of education is not those courses in Greek Philosophy but simply the selection bias of finding higher ability individuals. It is useful to reflect on what sort of situations are likely to have large selection bias in the evaluation of marketing actions. Mass media like TV or print are typically only targeted at a very broad demographic group. For example, advertisers on the Super Bowl are paying a great deal of money to target men aged 25-45. There is year-to-year variation in Super Bowl viewership which in principle would allow us to estimate some sort of regression based model of the effect of exposure to Super Bowl ads. The question is what is the possible selection bias? It is true that the effectiveness of a beer ad on those who view the Super Bowl versus a random consumer may be very different, but, that may not be relevant to the Super Bowl advertiser. The SB advertiser cares more about the effect on the treated, that is the effect of exposure on those in the target audience who view the SB. Are those who choose not to view the SB in year X different from those who view the SB in year Y? Not necessarily, viewership is probably driven by differences in the popularity of the teams in the SB. Thus, if our interest is the effect on the treated Super Bowl fan, there probably is little selection bias (under the assumption that the demand for beer is similar across the national population of SB fans). However, selection bias is a probably a very serious problem in other situations. Consider a firm like North Face that markets outdoor clothing. This is a highly sea-

4 Causal inference and experimentation

sonal industry with two peaks in demand each year: one in the spring as people anticipate summer outdoor activities and another in the late fall as consumers are purchasing holiday gifts. North Face is aware of these peaks in demand and typically schedules much of its promotional and advertising activity to coincide with these peaks in demand. This means we can’t simply compare sales in periods of high advertising activity to sales in periods of low as we are confounding the seasonal demand shift with the effect of marketing. In the example of highly seasonal demand and coordinated marketing, the marketing instruments are still mass or untargeted for the most part (other than demographic and, possible, geographic targeting rules). However, the problem of selection bias can also be created by various forms of behavioral targeting. The premier example of this is the paid search advertising products that generate much of Google Inc.’s profits. Here the ad is triggered by the consumer’s search actions. Clearly, we can’t compare the subsequent purchases of someone who uses search keywords related to cars with those consumers who were not exposed to paid search ads for cars. There is apt to be a huge selection bias as most of those not exposed to the car keyword search ad are not in the market to purchase a car. Correlational analyses of the impact of paid search ads are apt to show a huge impact that is largely selection bias (see Blake et al., 2015 for analysis of paid search ads for eBay in which they conclude that they have little effect). There is no question that targeting ads based on the preferences of customers as revealed in their behavior is apt to become even more prevalent in the future. This means that, for all the talk of “big data” we are creating more and more data that is not amenable to analysis with our standard bag of statistical tricks.

4.3 Randomized experimentation The problem with observational data is the potential correlation between “treatment” assignment and the potential outcomes. We have seen that this is likely to be a huge problem for highly targeted forms of marketing activities where the targeting is based on customer preferences. More generally, any situation in which some of the variation in the right hand side variables is correlated with the error term in the sales response equation will make any “regression-style” method inconsistent in estimating the parameters of the causal function. For example, the classical errors-in-variables model results in a correlation between the measured values of the rhs variables and the error term. In a randomized experiment, the key idea is that assignment to the treatment is random and therefore uncorrelated with any other observable or unobservable variable. In particular, assignment to the treatment is uncorrelated with the potential outcomes. This eliminates the selection bias term. E [Yi (0) |Di = 1] − E [Yi (0) |Di = 0] = 0 This means that the difference in means between the treated and untreated populations consistently estimates not only the effect on the treated but also the average effect or the effect on the person chosen at random from the population.



CHAPTER 2 Inference for marketing decisions

However, it is important to understand that when we say person chosen at random from the “population” we are restricting attention to the population of units eligible for assignment in the experiment. Deaton and Cartright (2016) call the set of units eligible for assignment to a treatment cell (including the control cell) the trial sample. In many randomized experiments, the trial sample is anything but a random sample of the appropriate population to which we wish to extrapolate the results of the experiment. Most experiments have a very limited domain. For example, if we randomly assign DMAs in the Northeast portion of the US, our population is only that restricted domain. Most of the classic social experiments in economics have very restricted domains or population to which the results can be extrapolated. Generalizability is the most restrictive aspect of randomized experimentation. Experimentation in marketing applications such as “geo” or DMA based experiments conducted by Google and Facebook start to get at experiments which are generalizable to the relevant population (i.e. all US consumers). Another key weakness of randomization is that this idea is really a large sample concept. It is of little comfort to the analyst that treatments were randomly assigned if it turns out that randomization “failed” and did not give rise to a random realized sample of treated and untreated units. With a finite N , this is a real possibility. In some sense, all we know is that statements based on randomization only work asymptotically. Deaton and Cartright (2016) make this point as well and point out that only when all other effects actually balance out between the controls and the treated does randomization achieve the desired aim. This only happens in expectation or in infinite size samples. If there are a large number of factors to be “balanced out,” then this may require very large N . A practical limitation to experimentation is that there can be situations in which randomization results in samples with low power to resolve causal effects. This can happen when the effects of the variables being tested are small, the sales response model has low explanatory power, and the sales dependent variable is highly variable. A simple case might be where you are doing an analysis of the effect of an ad using individual data and no other covariates in the sales response model. The standard errors of the causal effect (here just the √ coefficient on the binary treatment variables) of course are decreasing only at rate N and increasing in the standard deviation of the error term. If the effects are small, then the standard deviation of the error term is about the same as the standard deviation of sales. Simple power calculations in these situations can easily result in experimental designs with thousands or even tens of thousands of subjects, a point made recently by Lewis and Rao (2014). Lewis and Rao neglect to say that if there are other explanatory variables (such as price and promotion) included in the model, then even though sales may be highly variable, we still may be able to design experiments with adequate power even with smallish N . If there are explanatory variables included in the response model (in addition to dummy variables corresponding to treatment assignment), then the variance of the error term can be much lower than the variance of dependent variable (sales). In these situations, the power calculations that lead to pessimistic views regarding the

4 Causal inference and experimentation

number of experimental subjects could change dramatically and the conclusions of Lewis and Rao may not apply. It should be emphasized that this is true even though these additional control variables will be (by construction) orthogonal to the treatment regressors. While Lewis and Rao’s point regarding the difficulties in estimating ad effects is well-taken due to the small size of ad effects, this is not true regarding randomized experimentation on pricing. Marketing researchers have long observed that price and promotional effects are often very large (price elasticities exceeding 3 in absolute value and promotional lifts of over 100 per cent). This means that randomized experiments may succeed in estimating price and promotional effects with far smaller numbers of subjects than for advertising experiments (Dubé and Misra, 2018 constitute an example of recent attempts to use experimentation to optimize pricing). While randomization might seem the panacea for estimation of causal effects, it has severe limitations for situations in which a large number or a continuum of causal effects are required. For example, consider the situation of two marketing variables and a possibly non-linear causal function: In order to maximize profits for choice of the two variables, we must estimate not just the gradient of this function at some point but the entire function. Clearly, this would require an continuum of experimental conditions. Even if we discretized the values of the variables used in the experiments, the experimental paradigm clearly suffers from the curse of dimensionality as we add variables to the problem. For example, the typical marketing mix model might include at least five or six marketing variables resulting in experiments with hundreds of cells. conjoint as an experiment

4.4 Further limitations of randomized experiments 4.4.1 Compliance in marketing applications of RCTs The history of randomized experiments dates from agricultural experiments in which the “treatment” consists of various agricultural practices such as fertilization etc. When assigned to an experimental cell, there were no “compliance” issues – whatever treatment was prescribed was administered. However, in both medical and social experimentation applications, compliance can be an important problem. Much of Heckman’s work centers around the evaluation of various labor market interventions such as job training programs. Even with complete randomized selection treatment, the government cannot compel US citizens to enroll in job training programs. The best we can do is randomized eligibility for treatment or assignment to treatment. Clearly, there can be selection bias in the acceptance of treatment by those assigned to treatment. Heckman and others have modeled this decision as a rational choice in which people consider the benefits of the job training program as well as their opportunity costs of time. In any event, the “endogeneity” of the actual receipt of treatment means that selection bias would affect the “naive” differences in means (or regression generalizations) approach to effect estimation. There are two ways to tackle this problem: (1) explicit modeling of the decision to accept treatment or (2) use of the treatment assignment as an instrumental variable (see discussion in Angrist and Pis-



CHAPTER 2 Inference for marketing decisions

chke, 2009, Section 4.4.3). The Instrumental Variables estimator, in this case, is to simply scale the difference in means by the compliance rate. In some situations such as geographically based ad experiments or store-level pricing experiments, compliance is not an issue in marketing.20 If we randomize exposure to an ad by DMAs, all consumers in the DMA will have the opportunity to be exposed to the ad. Non-compliance would require consumers in the DMA to deliberately avoid exposure to the ad. Note that this applies to any ad delivery mechanism as long as everyone in the geographic area has the opportunity to become exposed. The only threat to the experimental design is “leakage” in which consumers in nearby DMAs are exposed to the stimulus. Store-level pricing or promotional experiments is another example where compliance is assured. However, consider what happens when ad experiments are conducted by assigning individuals to an ad exposure treatment. For example, Amazon randomly assigns customers to be exposed to an ad or a “recommendation” while others are assigned not be exposed. However, not all those assigned to the treatment cell will actually be exposed to the ad. The only way to be exposed to the ad is to visit the Amazon site (or mobile app). Those who are more frequent visitors to the Amazon website will have a higher probability of being exposed to the ad than those who are less frequent or intense users. If the response to the ad is correlated to visitor frequency, then the analysis of the RCT for this ad will only reveal the “intent -to-treat” effect and not the average treatment effect on the treated. One way to avoid this problem is to randomize assignment by session and not by user (see Sahni, 2015 for details). The compliance issue in digital advertising becomes even more complicated due to the “ad campaign” optimization algorithms used by advertising platforms to enhance the effect of ads. For example Facebook has established a randomized experimentation platform (see Gordon and Zettelmeyer, 2017 for analysis of some Facebook ad experiments). The idea is to allow Facebook advertisers to easily implement and analyze randomized experiments for ad campaigns. Facebook users will be randomly assigned to a “control” status in which they will not be exposed to ads. Ads are served by a complicated auction mechanism. For controls, if the ad campaign in question wins an auction for space on a Facebook page that a control is viewing, then the “runner-up” will be served. This insures one-sided compliance – the controls will never have access to the treatment. However, the “experimental” unit will only have an opportunity to view the ad if they visit Facebook as we have already pointed out. However, the problem of compliance is made worse by the Facebook ad campaign “optimization” feature. Any ad campaign is invoked for some span of time (typically 3-5 weeks on Facebook). Data from exposures to the ad early in the campaign is used 20 There are always problems with implementation of store experiments. That is, the researcher must

verify or audit stores to insure treatments are implemented and during the time periods of the experiment. This is not a “compliance” issue as compliance addresses whether or not experimental subjects can decide or influence exposure to the treatment. Consumers are exposed to the properly executed store experiment stimuli. Similarly, ad experiments do not have an explicit compliance problem. There may be a leakage problem across geographies but this is not a compliance problem.

4 Causal inference and experimentation

to model who should be exposed to the ad in later stages of the campaign, in addition to whatever targeting criteria are embedded in the campaign. Thus, the probability of exposure to the ad can vary for two Facebook users who visit Facebook with the same frequency and intensity. This means that the Facebook experimentation platform can only estimate an intent to treat effect of the ad campaign and not the effect the treated. Johnson et al. (2017) construct a proxy control ad which they term a “Ghost Ad” which they claim avoids some of the un-intended consequences of the in-campaign optimization methods and can be implemented at lower cost than a more traditional approach in which the control group is exposed to a “dummy” ad or public service announcement. While the “ghost ad” approach appears promising as a way to reduce costs and deal with ad optimization questions, this must be achieved at some power costs which are not yet clear.

4.4.2 The Behrens-Fisher problem We have explained that the simplest randomized experiment would consist of only control and one treatment cell and the effect of treatment would be estimated by computing the difference in means. Without extensive control covariates, the difference in means is apt to be a very noisy, but consistent (in the number of consumers in the experiment) of the causal effect of treatment. However, Deaton and Cartright (2016) point out that inference with standard methods faces what statisticians have long called the “Behrens-Fisher” problem. If variance of the outcome variable is different between control and treatment groups, then the distribution of the difference in means will be a function of the variance ratio (there is no simple t-distribution anymore). Since the distribution of the test-statistic is dependent on unknown variance parameters, standard finite sample testing methods cannot be used. Given that there is random assignment to control and treatment groups, any differences in variability must be due to the treatment effect. In a world with heterogeneous treatment effects, we interpret the difference in means between controls and treated as measure the average effect of the treatment. Thus, having the heterogeneous treatments in the error term will create a variance component not present for the controls. For this reason, we might expect that the variance of the treated cell will be higher than the control cell and we are faced with the Behrens-Fisher inference problem. One could argue that Behrens-Fisher problem is apt to be minimal in advertising experiments as the treatment effects are small so that the variance component introduced in the ad exposure treatment cell would be small. However, in experiments related to pricing actions, it is possible that the Behrens-Fisher problem could be very consequential. Many trained in modern econometrics would claim that the Behrens-Fisher problem could be avoided or “solved” by the use of so-called heteroskedasticity-consistent (White) variance-covariance estimators. This is nothing more than saying that, in large samples, the Behrens-Fisher problem “goes away” in the sense that we can consistently recover the different variances and proceed as though we actually know the variances of the two groups. This can also be seen as a special case of the “cluster”



CHAPTER 2 Inference for marketing decisions

variance problem with only two clusters. Again, heteroskedastic-consistent estimators have long been advocated as a “solution” to the cluster variance problem. However, it is well known that heteroskedastic consistent variance estimators can have very substantial finite sample biases (see Imbens and Kolesar, 2016 for explicit simulation studies for the case considered here of two clusters). There appears to be no way out of the Behrens-Fisher problem without additional information regarding the relative size of the two variances.

4.5 Other control methods We have seen that randomization can be used to consistently estimate causal effects (or eliminate selection bias). In Section 5, we will discuss Instrumental Variables approaches. One way of viewing an IV is as a source of “naturally” occurring randomization (IVs) can help solve the fundamental problem of causal inference. Another approach is to add additional covariates to the analysis in hopes of achieving independence of the treatment exposure conditional on these sets of covariates. If we can find covariates that are highly correlated with the unobservables and then add these to the sales response model, then the estimate on the treatment or marketing variables of interest can be “cleaner” or less confounded with selection bias.

4.5.1 Propensity scores If we have individual level data and are considering a binary treatment such as ad exposure, then conditioning on covariates to achieve approximate independence, simplifies to the use of propensity scores as a covariate. The propensity score21 is nothing more than the probability that the individual is exposed to the ad as a function of covariates (typically the fitted probability from a logit/probit model of exposure). For example, suppose we want to measure the effectiveness of a YouTube ad for an electronic device. The ad is shown on a YouTube channel whose theme is electronics. Here the selection bias problem can be severe – those exposed to the ad may be pre-disposed to purchase the product. The propensity score method attempts to adjust for these biases by modeling the probability of exposure to the ad based on covariates such as demographics and various “techno-graphics” such as browser type and previous viewing of electronics YouTube channels. The propensity score estimate of the treatment or ad exposure effect would be from a response model that includes the treatment variable as well as the propensity score. Typically, effect sizes are reduced by inclusion of the propensity score in the case of positive selection bias. Of course, the propensity score method is only as good as the set of co-variates used to form the propensity score. There is no way to test that a propensity score fully adjusts for selection bias other than confirmation via true randomized experimentation. Goodness-of-fit or statistical significance of the propensity score model is

21 See Imbens and Rubin (2014), Chapter 13, for more details on propensity scores.

4 Causal inference and experimentation

re-assuring but not conclusive. There is a long tradition of empirical work in marketing that demonstrates that demographic variables are not predictive of brand choice or brand preference.22 This implies that propensity score models built on standard demographics are apt to be of little use reducing selection bias and obtaining better causal effect estimates. Another way of understanding the propensity score method is to think about a “synthetic” control population. That is, for each person who is exposed to the ad, we find a “twin” who is identical (in terms of product preferences and ability to buy) who was not exposed to the ad. The difference in means between the exposed (treatment) group and this synthetic control population should be a cleaner estimate of the causal effect. In terms of propensity scores, those with similar propensity scores are considered “twins.” In this same spirit, there is a large literature on “matching” estimators that attempt to construct synthetic controls (cf. Imbens and Rubin, 2014, Chapters 15 and 18). Again, any matching estimator is only as good as the variables used in implementing “matching.”

4.5.2 Panel data and selection on unobservables The problem of selection bias and the barriers to causal inference with observation data can also be interpreted as the problem of “selection on unobservables.” Suppose our goal is to learn about the income effects for the demand for a class of goods such as private label goods (see, for example, Dubé et al., 2018). We could take a crosssection of households and examine the correlation between income and demand for private label goods. If we are interested in how the business cycle affects demand for private labels, then we want the true causal income effect. It could be that there is some unobservable household trait (such as pursuit of status) that drives both attainment of higher income as well as lowers the demand for lower quality private label goods. This unobservable would create a spurious negative correlation between household income and private label demand. Thus, we might be suspicious of crosssectional results unless we can properly control (by inclusions of the appropriate covariate) for the “unobservables” by using proxies for the unobservable or direct measurement of the unobservable. If we have panel data and we think that there are unobservables that are time invariant, then we can adopt a “fixed effects” style approach which uses only variation within unit over time to estimate causal effects. The only assumption required here is that the unobservables are time invariant. Given that marketing data sets seldom span more than a few years, this time invariance assumption seems eminently reasonable. It should be noted that if the time span increases a host of non-stationarities arise such as the introduction of new products, entry of competitors, etc. In sum, it is not clear that we would want to use a long time series of data without modeling the evolution of the industry we are studying. Of course as pointed out in Section 3.1 above, the fixed effects approach only works with linear models.

22 See, for example, Fennell et al. (2003).



CHAPTER 2 Inference for marketing decisions

Consider the example of estimating the effect of a Super Bowl ad. Aggregate time series data may have insufficient variation in exposure to estimate ad effects. Pure cross-sectional variation confounds regional preferences for products with true useful variation in ad exposure. Panel data, on the other hand, might be very useful to isolate Super Bowl ad effects. Klapper and Hartmann (2018) exploit a short panel of six years of data across about 50 different DMAs to estimate effects of CPG ads. They find that there is a great deal of year-to-year variation in the same DMA in SB viewership. It is hard to believe that preferences for these products vary from year to year in a way that is correlated with the popularity of the SB broadcast. Far more plausible, is that this variation depends on the extent to which the SB is judged to be interesting at the DMA level. This could be because a home team is in the SB or it could just be due to the national or regional reputation of the contestants. Klapper and Hartmann estimate linear models with Brand-DMA fixed effects (intercepts) and find a large and statistically significant effect of SB ads by beer and soft drink advertisers. This is quite an achievement given the cynicism in the empirical advertising literature about ability to have sufficient power to measure advertising effects without experimental variation. Many, if not most, of the marketing mix models estimated today are estimated on aggregate or regional time series data. The success of Klapper and Hartmann in estimating effects using more disaggregate panel data is an important source of hope for the future of marketing analytics. It is well known that the idea of using fixed effects or unit-specific intercepts does not generalize to non-linear models. If we want to optimize the selection of marketing variables then we will have to use more computationally intensive hierarchical modeling approaches to allowing response parameters to vary over cross-sectional units. Advocates of the fixed effects approach argue that the use of fixed effects does not require any distributional assumptions nor the assumption that unit parameters are independent of the rhs variables. Given that it is possible to construct hierarchical models with a general distributional form as well as to allow unit characteristics to affect these distributions, it seems the time is ripe to move to hierarchical approaches for marketing analytics with non-linear response models.

4.5.3 Geographically based controls In the area of advertising research, some have exploited a control strategy that depends on some of the historical institutional artifacts in purchase of TV advertising. In the day in which local TV stations were limited by the reach of their signal strength, it made sense to purchase local TV advertising on the basis of a definition of media market that include the boundaries of the TV signal. There are 204 such “Designated Market Areas” in the US. Local TV advertising is purchased by DMA. This means that there are “pairs” of counties on opposite sides of a DMA boundary, one of which receives the ad exposure while the does not. Geographical proximity also serves as a kind of “control” for other factors influencing ad exposure or ad response. Shapiro (2018) uses this strategy to estimate the effect of direct-to-consumer ads for various anti-depressant drugs. Instead of using all variation in viewership across counties and

4 Causal inference and experimentation

across time, Shapiro limits variation to a relatively small number of “paired” DMAs. Differences in viewership between these two “bordering” DMA is used to identify ad effects. Shapiro finds only small differences between ad effects estimated with his “border strategy” vs not. However, this idea of exploiting institutional artifacts in the way advertising is purchased is a general idea which might be applied in other ways. However, the demise of broadcast or even subscription TV in favor of streaming will likely render this particular “border strategy” increasingly irrelevant. But the idea of exploiting the discreteness in the allocation or exposure rule used by firms in a case of what is called a regression discontinuity design discussed below.

4.6 Regression discontinuity designs Many promotional activities in marketing are conducted via some sort of threshold rule or discretized into various “buckets.” For example, consider the loyalty program of a gambling casino. The coin of the realm in this industry is the expected win for each customer which is simply a function of the volume of gambling and type of game. The typical loyalty program encourages customers to gamble more and come back to the casino by establishing a set of thresholds. As customers increase their expected win, they “move” from one tier or “bucket” in this program to the next. In the higher tiers, the customer receives various benefits like complementary rooms or meals. The key is that there is a discrete jump in benefits by design of the loyalty program. On the other hand, it is hard to believe that the response function of the customer to the level of complementary benefits is non-smooth or discontinuous. Thus, it would seem that we can “select” on the observables to compare those customers whose volume of play is just on either side of each discontinuity in the reward program. As Hartmann et al. (2011) point out, as long as the customer is not aware of the threshold or the benefits from “selecting in” or moving to the next tier are small relative to the cost of greater play, this constitutes a valid Regression Discontinuity (RD) design. Other examples in marketing include direct mail activity (those who receive offers and or contact are a discontinuous function of past order history) and geographic targeting (it is unlikely people will move to get the better offer). But, if consumers are aware that are large promotions or rebates for a product and they can change their behavior (such as purchase timing), then an RD approach is likely to be invalid. Regression discontinuity analysis has received a great deal of attention in economics as well (see Imbens and Lemieux, 2008). The key assumption is that the response function is continuous in the neighborhood of the discontinuity in the assignment of the treatment. There are both parametric and non-parametric forms of analysis, reflecting the importance of estimating the response function without bias that would aversely affect the RD estimates. Parametric approaches require a great deal of flexibility which may compromise power, while non-parametric methods rest on the promise to narrow the window of responses used in the vicinity of the threshold (s) as the sample size increases. This is not much comfort to the analyst with one



CHAPTER 2 Inference for marketing decisions

finite sample. Non-parametric RD methods are profligate with data as, ultimately, most of the data is not used in forming treatment effect estimates. RD designs result in only local estimates of the derivative of the response function. For this reason, unless the ultimate treatment is really discrete, RD designs do not offer a solution to the marketing analytics problem of optimization. RD designs may be helpful to corroborate the estimates based on response models fit to the entire dataset (the RD estimate and the derivative the response function at the threshold should be comparable).

4.7 Randomized experimentation vs. control strategies Recent work in marketing compares inferences based on various control strategies (including propensity scores and various “matching” or synthetic control approaches with the results of large scale randomized experiments performed at Facebook. Gordon and Zettelmeyer (2017) find that advertising effects estimated from observational methods do not agree very closely with those based on randomized experimentation in the context of ad campaigns evaluated on the Facebook ad platform. If various control strategies are inadequate, then we might expect that ad effects estimated by observational data would be larger than those estimates that are based on randomized experimentation (at least up to sampling variation). Gordon and Zettelmeyer do not find any consistent pattern of this sort. They find estimates based on observation data to be, in some cases, smaller than those based on experimentation with nonoverlapping confidence intervals. This result is difficult to understand and implies that there are important unobservables which are positively related to ad exposure and negatively related to ad effects. However, it is pretty clear that the jury is out on the efficacy of observational methods as Eckles and Bakshy (2017) find that observational methods (propensity scores) produce ad effect estimates which are close to those obtained from randomized experimentation in a similar context involving estimation of peer effects with Facebook data. It is possible that Facebook ad campaign “optimization” may make that comparison between the observational data-based effect estimates and the randomized trial results less direct than Gordon and Zettelmeyer imply.

4.8 Moving beyond average effects We live in a world of heterogeneous treatment effects in which each consumer, for example, has a different response to the same ad campaign. In the past, the emphasis in economics is on estimating some sort of average treatment effects which is thought to be adequate for policy evaluation. Clearly, the distributional effects of policies are also important and, while the randomized experiment does identify the average treatment effect with minimal assumptions, randomized experimentation does not identify the distribution of effects without imposing additional assumptions. In marketing applications, heterogeneity assumes even greater importance than in economic policy evaluation. This is because the policies in marketing are not applied uniformly to a subset of consumers but, rather, include the possibility of targeting

5 Instruments and endogeneity

policies based on individual treatment effects. A classic example of this problem is the problem in direct marketing of to whom a catalog or offer should be sent to from the very large set of customers whose characteristics are summarized in the “house” file of past order and response behavior. Classically, direct marketers built models that are standard marketing response models in which order response to a catalog or offer is modeled as a function of the huge set of variables that might be constructed using the house data file. This raises two inference problems. First, the model-builder must have a way of selecting from a set of variables that is may be even larger than the number of observations. Second, the model-builder should recognize that there may be unobservables that create the classic selection bias problem. The selection bias problem can be particularly severe when the variables used as “controls” are simply summaries of past response behavior as must be, by construction, from house file data. How then does randomization help the model-builder? If there is a data-set where exposure to the marketing action is purely random, then there are no selection bias problems and there is nothing wrong with using regression-like methods to estimate or predict response to the new offering (i.e. “optimal targeting”). The problem then becomes more of a standard non-parametric modeling problem of selecting the most efficient summaries of the past behavior to be included as controls in the response model. Hitsch and Misra (2018) compare a number of different methods to estimate heterogeneous treatment effects based on a randomized trial and evaluate various estimators with respect to their potential profitability.

5 Instruments and endogeneity23 The problem of causal inference and the potential outcomes framework has recently assumed greater importance in the economics literature but that is not to say that the problem of causal inference has only recently been addressed. The original concern of the Cowles commission was to obtain consistent estimates of “structural” parameters using only observation data as well as the recognition that methods that assume right-hand-side variables are exogenous may not be appropriate in many applications. In most applications, the “selection” bias or “selection on unobservables” interpretation is appropriate and econometricians have dubbed this the “endogeneity” problem. One basic approach to dealing with this problem is to find some way of partitioning the variation in the right-hand-side variable so that some of the variation can be viewed as “though random.” This involves selection of an instrument. In this section, we provide a detailed discussion of the instrumental variables approach.

23 This section was adapted in large part from Rossi (2014b).



CHAPTER 2 Inference for marketing decisions

As we have indicated, Instrumental Variable (IV) methods do not use all of the variation in the data to identify causal effects, but instead partition the variation into that which can be regarded as “clean” or as though generated via experimental methods and that which is “contaminated” and could result in endogeneity bias. “Endogeneity bias” is almost always defined as the asymptotic bias for an estimator which uses all of the variation in the data. IV methods are only asymptotically unbiased if the instruments are valid instruments. Validity is an unverifiable assumption. Even if valid, IV estimators can have poor sampling properties including fat tails, high RMSE, and bias. While most empirical researchers may recall that the validity assumption is important from their econometrics training, the poor sampling properties of IV estimators are not well appreciated. Careful empirical researchers are aware of some of these limitations of IV methods and, therefore, sometimes view the IV method as a form of sensitivity analysis. That is, estimates of causal effects using standard regression methods are compared with estimates based on IV procedures. If the estimates are not appreciably different, then some conclude that endogeneity bias is not a problem. While this procedure is certainly more sensible than abandoning regression methods altogether, it is based on the implicit assumption that the IV method uses valid instruments. If the instruments are not valid, then the differences between standard regression style estimates and IV estimates don’t have any bearing on the existence or extent of endogeneity bias. Closely related to the problem of endogeneity bias is the problem of omitted variables in cross-sectional analyses or pooled analyses of panel data. Many contend that there may exist unobservable variables that a set of control variables, no matter how exhaustive, cannot control for. For this reason, researchers often use a Fixed Effects (hereafter FE) approach in which cross-sectional unit specific intercepts are included in the analysis. In a FE approach, the slope coefficients on variables of interest are only identified using only “within” variation in the data. Cross-sectional variation is thrown out. Advocates for the FE approach argue that, in contrast to IV methods, the FE approach does not require any further assumptions than those already used by the standard linear regression analysis. The validity of the FE approach depends critically on the assumption of a linear model and the lack of measurement error in the independent variables.24 If there is measurement error in the independent variables, then the FE approach will generally magnify the errors-in-the-variables bias.

5.1 The omitted variables interpretation of “endogeneity” bias In marketing applications, the omitted variable interpretation of endogeneity bias provides a very useful intuition. In this section, we will briefly review the standard omitted variables analysis and relate this to endogeneity bias. For those familiar with 24 If lagged dependent variables are included in the model, then the standard fixed effects approach is

invalid, see Narayanan and Nair (2013); Nickell (1981).

5 Instruments and endogeneity

the omitted variables problem, this section will simply serve to set notation and a very brief review (see also treatments in Section 4.3 of Woolridge, 2010 or Section 3.2.2 of Angrist and Pischke, 2009). Consider a linear model with one independent variable (note: the intercept is removed for notational simplicity). yi = βxi + εi


The least squares estimator from a regression of y on x will consistently estimate parameters of the conditional expectation of y given x under the restriction that the conditional expectation is linear in x. However, the least squares estimator will converge to β only if E [ε|x] = 0 (or cov (x, ε) = 0). plim

x y = β + plim x x

x ε N x x N

 = β + plim

x x N

−1 plim

x ε = β + Q × cov (x, ε) N

Here Q−1 = plim xNx . Thus, least squares will consistently estimate the “structural” parameter β only if (40) can be considered a valid regression equation (with an error term that has a conditional expectation of zero). If E [ε|x] = 0, then least squares will not be a consistent estimator of β. This situation can arise if there is an omitted variable in the equation. Suppose there exists another variable, w, which belongs in the equation in the sense that the multiple regression of y on x and w is a valid equation. yi = βxi + γ wi + εi E [ε|x, w] = 0 The least squares regression of y on x alone will consistently recover the parameters of the conditional expectation of y given x which will not necessarily be β     E y|x = βx + E γ w + ε|x = βx + γ E [w|x] = βx + γ πx = δx Here π is the coefficient of w in the conditional expectation of w|x. If π = 0, then the least squares estimator will not consistently recover β (sometimes called the structural parameter) but instead will recover δ. The intuition is that, in the simple regression of y on x, least squares estimates the effect of x without controlling for w. This estimate confounds two effects: (1) the direct effect of x (β) and (2) the indirect effect of x (γ π). The indirect effect (which is non-zero whenever x and w are correlated) also has a very straightforward interpretation: for each unit change in x, w will change by π units and this will, in turn, change y (on average) by γ units. In situations where δ = β, there is an omitted variable bias. The solution, which is feasible only if w is observable, is to run the multiple regression of y on x and w. Of course, the multiple regression does not use all of the variation in x to estimate the multiple regression coefficient – only that part of the variation in x which is uncorrelated with w. Thus, we can see that a multiple regression method is more



CHAPTER 2 Inference for marketing decisions

demanding of the data in the sense that only part of the variation of x is used. In a true randomized experiment, there is no omitted variable bias because the values of x are assigned randomly and, therefore, are uncorrelated by definition with any other variable (observable or not). In the case of the randomized experiment, the only motivation for bringing in other covariates is to reduce the size of the residual standard error which can improve the precision of estimation. However, if the simple regression model produces statistically significant results, there is no reason for adding covariates. The standard recommendation for limiting omitted variable bias is to include as many “control” variables or covariates as possible. For example, suppose that we observe demand for a given product across a cross-section of markets. If we regress quantity demanded on price across these markets, a possible omitted variable bias is that there are some markets where there is a higher demand for the product than others and that price is set higher in those markets with higher demand. This is a form of omitted variable bias where the omitted variable is some sort of indicator of market demand conditions. To avoid omitted variable bias, the careful researcher would add covariates (such as average income or wealth measures) which seek to “control” or proxy for the omitted demand variable and use a multiple regression. There is a concern that these control or proxy variables are only imperfectly related to true underlying demand conditions which are never perfectly predicted or “observable.”

5.2 Endogeneity and omitted variable bias Most applied empirical researchers will identify “endogeneity bias” as arising from correlation between independent variables and error terms in a regression. This is to describe a cold by its symptoms. To develop a strong intuitive understanding, it is helpful to give an omitted variables interpretation. Assume that there is an unobservable variable, ν, which is related to both y and x. yi = βxi + αy νi + εy,i xi = αx νi + εx,i

(41) (42)

Here both εx , εy have 0 conditional mean given x and ν and are assumed to be independent. In our example of demand in a cross-section of markets, ν represents some unknown demand shifter variable that allows some markets to have a higher level of demand for any given price than others. Thus, ν is an omitted variable and has the potential to cause omitted variable bias if ν is correlated with x. The model listed in (42) builds this correlation in by constructing x from ν and another exogenous error term. The idea here is that prices are set partially as a function of this underlying demand characteristic which is observable to the firm but not observable to the researcher. In the regression of y on x, the error term is now ανi + εy,i which is correlated with x. This form of omitted variable bias is called endogeneity bias. The term “endogeneity” comes from the notion that x is no longer determined “exogenously” (as if via an experiment) but is jointly determined along with y.

5 Instruments and endogeneity

We can easily calculate the endogeneity bias by taking conditional expectations (or linear projections) of y given x.     E y|x = βx + E αy ν + εy |x   σν2 x = βx + αy αx αx2 σν2 + σε2x  The ratio αx

σν2 αx2 σν2 +σε2x


 is simply the regression coefficient from a regression of

the composite error term (including the unobservable) on x. The endogeneity bias is thus the coefficient on x in (43). Whenever the unobservable has variation which comprises a large fraction of the total variation in x, and has the unobservable has a large effect on y, the endogeneity bias will be large. If we go back to our example of price endogeneity in a cross-section of markets, this would mean that the demand differences across markets would have to be large relative to other influences that shift price. In addition, the influence of the unobservable demand shifter on demand (y) must be large.

5.3 IV methods As we have seen the “endogeneity” problem is best understood as arising from an unobservable that is correlated both with the error in the “structural” equation and one or more of the right side variables in this equation. Regression methods were originally designed for experimental data where the x variable was chosen by the investigator as part of the experimental design. For observational data, this is not true and there is always the danger that there exists some unobservable variable which has been omitted from the structural equation. This makes a concern for endogeneity a generic criticism which can always be applied. The ideal solution to the endogeneity problem would be to conduct an experiment in which the x variable is, by construction, uncorrelated via randomization with any unobservable. Short of this ideal, researchers opt to partition the variation in x variable25 into two parts: (1) variation that is “exogenous” or unrelated to the structural equation error term and (2) variation that might be correlated with the error term. Of course, this partition always exists; the only question is whether or not the partition can be accessed by the use of observable variables. If such an observable variable exists, then it must be correlated with x variable but it must not enter the structural equation. Such a variable is termed an “instrumental variable.” The idea of an instrument is that this variable moves around x but does not affect y in direct way, only indirectly via x. Of course, there can be many instrumental variables.

25 For simplicity, I will consider the case of only one right hand side endogenous variable. There is no

additional insight gained from the multiple rhs variable case and the great majority of applied work only considers endogeneity in one variable.



CHAPTER 2 Inference for marketing decisions

5.3.1 The linear case The case of a linear structural equation and linear instrumental variable model provides the intuition for the general case and also includes many of empirical applications of IV methods. However, it should be noted that due the widespread use of choice models in marketing applications, there is a much higher incidence of the use of nonlinear models. We consider nonlinear choice models in Section 5.7. (44) and (45) constitute the linear IV model. y = βx + γ w + εy x = δ z + εx

(44) (45)

(44) is the structural equation. The focus is on estimation of the “structural” parameter, β, avoiding endogeneity bias. There is the possibility there are other variables inthe “structural” equation which are exogenous in the sense that we assume that  E εy |w = 0. If these variables are comprehensive enough, meaning that almost all of the variation in the unobservable that is at the heart of the endogeneity problem can be explained by w, then the “endogeneity” problem ceases to be an issue. The regression methods will only use the variation in x that is independent of w and, under the assumption that the w controls are complete, then there should be no endogeneity problem. For the purpose of this exposition, we will assume that  E εy |x, w = f (x) = 0, or that we still have an endogeneity problem. The second equation (45) is a just a linear projection of x on the set of instrumental variables and is often called the instruments or “first-stage” equation. In a  linear model, the statement, E εy |x, w = 0, is equivalent to corr εx , εy = 0. In the omitted variable interpretation, this correlation in the equation errors is brought about by a common unobservable. As the correlation between the errors increases, the “endogeneity bias” becomes more severe. The critical assumption in the linear IV model is that the instrumental variables, z, do not enter into the structural equation. This means that the instruments only have an indirect effect on y via movement in x but no direct effect. This is restriction is often called the exclusion restriction or sometimes the over-identification restriction. Unfortunately, there is no way to “test” the exclusion restriction because the model in which the z variables enters both equations is not identified.26

5.3.2 Method of moments and 2SLS There are a number of ways to motivate inference for the linear IV model in (44)-(45). The most popular is the method of moments approach. For the sake of brevity and notational simplicity, consider the linear IV model with only one instrument and no other “exogenous” variables in the structural equation. The method of moments estimator exploits the assumption that z (now just a scalar r.v.) is uncorrelated or orthogonal to the structural equation error. This is called a moment condition and 26 The so-called “Hausman” test requires at least one instrument for which the investigator must assume

the exclusion restriction holds.

5 Instruments and endogeneity

  involves an assumption about population or data generating model that E εy z = 0. The method of moments principle defines an estimator by minimizing the discrepancy between the population and sample moments.  z y    βˆMM = argmin E εy z − (y − βx) z =

zx β


Here y, x, z are N × 1 vectors of the observations. It is easy  to see that this estima  zε tor is consistent (because we assume E εy z = 0 = plim Ny ) and asymptotically normal. If the structural equation errors are uncorrelated and homoskedastic, it can be shown (see, for example, Hayashi, 2000, Section 3.8) that the particular method of moments estimator in (46) is the optimal Generalized Method of Moments Estimator. If the structural equation errors are conditionally heteroskedastic and/or autocorrelated, then the estimator above is no longer optimal and can be improved upon. It should be emphasized that when econometricians say that an estimator is optimal, this only means that the estimator has an asymptotic distribution with variance not exceeding that of any other estimator. This does not mean that, in finite samples, the method of moments estimator has better sampling properties than another other estimator. In particular, even estimators with asymptotic bias such as least squares can have lower mean-squared error than IV estimators. Another way of motivating the IV estimator for the simple linear IV model is the principle of Two Stage Least Squares (2SLS). The idea of Two Stage Least Squares is much the same as how it is possible to perform a multiple regression via a sequence of simple regressions. The “problem” with the least squares estimator is that some of the variation in x is not exogenous and correlated with the structural equation error. The instrumental variables can be used to purge x of any correlation with the error term. The fitted values from a regression of x on z will be uncorrelated with the structural equation errors. Thus, we can use the fitted values from a “first-stage” regression of x on z and regress y on the fitted values from this first-stage (this is the second-stage regression). ˆ + ex x = xˆ + ex = δz


y = βˆ2SLS xˆ + ey


This procedure yields the identical estimator as the MM estimator in (46). If there are more than one instrument, more than one rhs endogenous variable, or if we include a matrix of exogenous variables in the structural equation, then both procedures generalize but the principle of utilizing the assumption that there exists a valid set of instruments and that one should only use that portion of the rhs endogenous variables can is accounted for by the instruments remains the same.

5.4 Control functions as a general approach A very useful way of viewing the 2SLS estimator is as a special case of the “control function” approach to obtaining an IV estimator. The control function interpretation



CHAPTER 2 Inference for marketing decisions

of 2SLS comes from the fact that the multiple regression coefficient on x is estimated using only that portion of the variation of x which is uncorrelated with the other variables in the equation. If we put a regressor in the structural equation which contains only that part of x which is potentially correlated with εy , then the multiple regression estimator would be a valid IV estimator. In fact, the 2SLS estimator can also be obtained by regressing y on x as well as the residual from the first-stage IV regression. y = βˆT SLS x + cex


ex is the residual from (47). Petrin and Train (2010) observe that the same idea can be applied to “control” for or eliminate (at least, asymptotically) endogeneity bias in a demand model with a potentially endogenous variable. For example, the control function approach can work even if the demand model is a nonlinear model such as a choice model. If x is a choice characteristic that might be considered potentially endogenous, then one can construct “control” functions from valid instruments and achieve the effects of an IV estimator simply by adding these control functions to the nonlinear model. Since, in non-linear models, the key assumption is not a zero correlation but conditional independence, it is necessary to not just project x on a linear function of the instruments, but to estimate the conditional mean function, E [x|z] = f (z). The conditional mean function is of unspecified form and this means that we need to choose functions of the instruments that can approximate any smooth function. Typically, polynomials in the instruments of high order should be sufficient. The residual, e = x − fˆ, is created and can be interpreted as that portion of x which is independent of the instruments. The controls required to be included in the nonlinear model must also allow for arbitrary flexibility in the way in which the residual is entered. Again, polynomials in the residual (or any valid set of basis functions) should work, at least for large enough samples, if we allow the polynomial order to increase with the sample size. The control function approach has a lot of appeal for applied workers as all we have to do is a first stage linear regression on polynomials in the instruments and simply add polynomials in the residual from this first stage to the nonlinear model. For linear index models like choice models, this simply means that we can do one auxiliary regression and I can use any standard method to fit the choice model, but with constructed independent variables. The ease of use of the control function approach makes it convenient for checking to see whether an instrumental variables analysis produces estimates that are much different. However, inference in the control function approach requires additional computations as the standard errors produced by the standard non-linear models software will be incorrect as they don’t take into account that some of the variables are “constructed.” It is not clear that from an inference point of view that the control function approach offers any advantages over using the general GMM method which computes valid asymptotic standard errors. The control function approach has a number of assumptions required to show consistency. However, there is some evidence that it will closely approximate the IV solution in the choice model situation.

5 Instruments and endogeneity

5.5 Sampling distributions In the OLS estimator (conditional on the X matrix) is a linear estimator with a sam −1

pling distribution derived from the distribution of βˆOLS − β = X X X ε. If the errors terms are homoskedastic and normal, then the finite sample distribution of the OLS sampling error is also normal. However, all IV estimators are fundamentally non-linear functions of the data. For example, the simple Method of Moments estimator (46) is a nonlinear function of the random variables. The proper way of viewing the linear IV problems is that, given a matrix of instruments, Z, the model provides the joint distribution of both y and x. Since x is involved non-linearly, via the term −1 z x , we cannot provide an analytical expression for the finite sample distribution of the IV estimator even if we make assumptions regarding the distribution of the error terms in the linear IV model (44)-(45). The sampling distribution of the IV√estimator is approximated by asymptotic methods. This is done by normalizing by N and applying a Central Limit Theorem.   z x −1 √ z ε √  y N βˆMM − β = N N N


As N approaches infinity, the denominator of the MM estimator, zNx , converges to a constant by the Law of Large Numbers. The asymptotic distribution is entirely √ driven by the numerator which has been expressed as N times a weighted sum of the error terms in the structural equation. The asymptotic distribution is then derived by applying a Central Limit Theorem to this average. Depending on whether or not the error terms are conditional heteroskedastic or autocorrelated (in the case of time series data) a different CLT is used. However, the basic asymptotic normality results are derived by assuming that the sample is large enough to that we can simply ignore the contribution of the denominator to the sampling distribution. While asymptotics greatly simplifies the derivation of a sampling distribution, there is very good reason to believe that this standard method of deriving the asymptotic distribution is apt to be highly inaccurate under the conditions in which the IV estimator is often applied. The finite sampling distribution can deviate from the asymptotic approximation in two important respects: (1) there can be substantial bias in the sampling distribution of the IV estimator even if the model assumptions hold and (2) the asymptotic approximation can be very poor and can dramatically understate the true sampling variability in the estimator. The simple Method of Moments estimator is a ratio of a weighted average of y to the weighted average of x. βˆMM =

z y N z x N

The distribution of a ratio of random variables is very different from the distribution of a linear combination of random variables (the distribution of OLS). Even if the error terms in the linear IV model are homoskedastic and normal, then distribution of



CHAPTER 2 Inference for marketing decisions

FIGURE 1 Distribution of a ratio of normals.

the Method of Moments IV estimator is non-normal. The denominator is the sample covariance between z and x. If this sample covariance is small, then the ratio can assume large positive and negative values. More precisely, if the distribution of the denominator puts appreciable mass near zero, then the distribution of the ratio will have extremely fat tails. The asymptotic distribution is using a normal distribution to approximate a distribution which has much fatter tails than the normal distribution. This means that the normal asymptotic approximation can dramatically understate the true sampling variability. To illustrate how ratios of normals can fail to have a normal distribution. Consider the distribution of a ratio of an N (1, .5) to an N (.1, 1) random variable.27 The distribution is shown by the magenta histogram in Fig. 1 and is revealed to be bimodal with the positive mode having slightly more mass. This distribution exhibits massive outliers and the figure only shows the histogram of the data trimmed to remove the top and bottom 1 per cent of the observations. The thick left and right tails are generated by draws from the denominator normal distribution which are close to the origin. The standard asymptotic approximation to the distribution of IV estimators simply ignores the denominator which is supposed to converge to a constant. The asymptotic approximation is shown by the green density curve in the figure. Clearly, this is a poor approximation that ignores the other mode and under-estimates variability. The 27 Here the second argument in the normal distribution is the standard deviation.

5 Instruments and endogeneity

dashed light yellow line in the figure represents a normal approximation based on the actual sample moments of the ratio of normals. The fact that this approximation is so spread-out is another way of emphasizing that the ratio of normals has very fat tails. The only reasonable normal approximation is shown by the medium dark orange curve which is fit to the observed InterQuartile range. Even this approximation misses the bi-modality of the actual distribution. Of course, the approximation based on the IQ range is not available via asymptotic calculations. The degree to which the ratio of normals can be well-approximated by a normal distribution depends on both the location and spread of the distribution. Obviously, if the denominator is tightly distributed around a non-zero value, then the normal approximation can be highly accurate. The intuition that we have established is when the denominator has a spread-out distribution and/or places mass near zero, then the standard asymptotic approximation will fail for IV estimators. This can happen into two conditions: (1) in small samples and (2) where the instruments are “weak” in the sense they explain only a small portion of the variation in x. Both cases are really about lack of information. The sampling distributions of IV estimators become very spread out with fat tails when there is little information about the true causal effect in the data. “Information” should be properly measured by total covariance of the instruments with x. This total covariation can be “small” even in what appear to be “large” samples when instruments have only weak explanatory power. In next section, we will explore what are the boundaries of the “weak” instrument problem.

5.6 Instrument validity One point that is absent from the econometrics literature is that the sampling distribution of IV estimators are only considered conditional on the validity of the instruments. This is an untestable assumption which certainly is violated in many datasets. This form of mis-specification is much more troubling than other forms of model mis-specification such as non-normality of the error terms, conditional heteroskedasticity, or non-linearity. For each of these mis-specification problems, we have tests for mis-specification and alternative estimators. There are also methods to provide inference (i.e. standard errors and confidence intervals) which are robust to model mis-specification for conditional heteroskedastic, auto-correlated, and non-normal errors. There are no methods which are robust to the use of invalid instruments. To illustrate this point, consider the sampling distribution of an IV estimator based on an invalid instrument. We simulate data from the following model. y = −2x − z + εy 

x = 2z + εx     εx 1 .25 ∼ N 0, ; εy .25 1

zi ∼ Unif (0, 1)




CHAPTER 2 Inference for marketing decisions

FIGURE 2 Sampling distributions of estimators with invalid instruments.

This is a situation with a relatively strong instrument (the population R-squared for the regression of x on z is about .25). Here N = 500 which is a large sample in many cross-sectional contexts. The instrument is invalid but with a smaller direct effect, −1, than an indirect effect, −4. Moreover, the structural parameter is also larger than the direct effect. Fig. 2 shows the sampling distribution of the method of moments estimator and the standard OLS estimator. Both estimators are biased and inconsistent. Moreover, the IV estimator has inferior sampling properties with a root mean-squared-error of more than seven times the OLS estimator. Since we can’t know if the instruments are valid, the argument that the IV estimator should be preferred because it is consistent conditional on validity is not persuasive. Conley et al. (2012) consider the problem of validity of instruments and use the term “plausibly exogenous.” That is to say, except for true random variation, it is impossible to prove that an instrument is valid. In most situations, the best that can be said is that the instrument is approximately valid. Conley et al. (2012) define this as an instrument which does not exactly satisfy an exclusion restriction (i.e. the assumption of no direct effect on the response variable) but that the instrument has a small direct effect relative to the indirect effect. From both sampling and Bayesian points of view, Conley et al. (2012) argue that a sensitivity analysis with respect to the exclusion restriction can be useful. For example, if minor (i.e. small) violations of the exclusion restriction do not fundamentally change inferences regarding the key effects, then Conley et al. (2012) consider the analysis robust to violations of instru-

5 Instruments and endogeneity

ment validity within that range. For marketing applications, this seems to be a very useful framework. We do not expect any instrument to exactly satisfy the exclusion restriction but we might expect the instrument to be “plausibly exogeneous” in the sense of a small violation. The robustness or sensitivity checks developed by Conley et al. (2012) help assess if the findings are critically sensitive to violations of exogeneity in a plausible range. This provides a useful way of discussing and evaluating the instrument validity issue. This was absent in the econometrics literature and is of great relevance to researchers in marketing who rarely have instruments for which there are air-tight exogeneity arguments.

5.7 The weak instruments problem 5.7.1 Linear models Not only are instruments potentially invalid, there is a serious problem when instruments are “weak” or only explain a small portion of the variation in the rhs endogenous variable. In situations of low information regarding causal effects (either because of small samples or weak instruments or both), then standard asymptotic distribution theory begins to break down. What happens is that asymptotic standard errors are no longer valid and are generally too small. Thus, confidence intervals constructed from asymptotic standard errors typically have much lower coverage rates than their nominal coverage probability. This phenomenon has spawned a large sub-literature in econometrics on the so-called weak or many instruments problem. In marketing applications, we typically do not encounter the “many” instrument problem in the sense that we don’t have more than a handful of potential instruments. There is a view among some applied econometricians that failure of standard asymptotics is only occurs for very small values of the first-stage R-squared or when the F-stat for the first stage is less than 10. This view comes from a misreading of the excellent survey of Stock et al. (2002). The condition of requiring the first stage F-stat be > 10 comes in the problem with only one instrument (in general, the “concentration parameter” or kF should be used). However, the results summarized in Stock et al. (2002) simply state that the “average asymptotic bias” will be less than 15 per cent in the case where kF > 10. This does not mean that confidence intervals constructed using the standard asymptotics will have good actual coverage properties (i.e. actual coverage close to nominal coverage). Nor does this result imply that there aren’t finite sample biases of an even greater magnitude than these asymptotic biases. The poor sampling properties of the IV estimator28 can easily be shown even in cases where the instruments have a modest but not small first-stage R-squared. We

28 Here I focus on the sampling properties of the estimator rather than the size of a test procedure. Simu-

lations found in Hansen et al. (2008) show that coverage probabilities and bias can be very large even in situations where the concentration ratio is considerably more than 10.



CHAPTER 2 Inference for marketing decisions

FIGURE 3 “Weak” instruments sampling distribution: p = 1.

simulate from the following system: y = −2x + εy 

x = Zδ + εx     εx 1 .25 ∼ N 0, εy .25 1

N = 100. Z is an N × p matrix of iid Unif (0, 1). The δ vector is made up of p identical elements,  chosen to make the population first-stage R-squared equal to .10 2

12ρ 2 using the formula, p(1−ρ 2 ) , where ρ is the desired R-squared value. Fig. 3 shows the sampling distribution of the IV and OLS estimators in this situation with p = 1. The method of moments IV estimator has huge tails, causing it to have a much larger RMSE than OLS. OLS is slightly biased but without the fat tails of the IV estimator. Fig. 4 provides the distribution of the first-stage F statistics for 2000 replications. The vertical line in the figure is the median. This means that more than 50 per cent of these simulated samples had F-stats of greater than 10, showing the fallacy of this rule of thumb. Thus, for this case of only a moderately weak (but valid!) instrument,

5 Instruments and endogeneity

FIGURE 4 Distribution of first-stage F-statistics.

the IV estimator would require a sample size of approximately four29 times larger than the OLS estimator to deliver the same RMSE level. Lest the reader form the false impression that the IV estimator doesn’t have appreciable bias, consider the case where there are 10 instruments instead of one but where all other parameters are held constant. Fig. 5 shows the sampling distributions in this case. The IV estimator now has both fat tails and finite sample bias. The weak instruments literature seeks to improve on standard asymptotic approximations to the sampling distribution of the IV estimator. The literature focuses exclusively on improving inference which is defined as obtaining testing and confidence interval procedures which have correct size. That is, the weak instruments literature assumes that the researcher has decided to employ an IV method and just wants a test or confidence interval with the proper size. This literature does not propose new estimators with improved sampling properties but merely seeks to develop improved asymptotic approximation methods. This literature is very large and has made considerable progress on obtaining test procedures with actual size very close to nominal size under a variety of assumptions. There are two major variants in this literature. One variant starts from what is called the Conditional Likelihood Ratio statistic and builds a testing theory which is exact under the homoskedastic, normal case (conditional on the error covariance matrix) (see Moreira, 2003 as an example). The other variant uses the GMM approach to define a test statistic which is consistent in the presence of heteroskedasticity and does not rely on normal errors (see Stock and Wright, 2000). The GMM variant will never deliver exact results but is potentially more robust. Both the CLR and GMM methods will work very well when the averages of y and x used in the IV estimator


 .5 2 .24 .



CHAPTER 2 Inference for marketing decisions

FIGURE 5 “Weak” instruments sampling distribution: p = 10.

(see, for example, (46)) are approximately normal. This happens, of course, when the CLT sets in quickly. The performance of these methods is truly impressive even in small samples in the sense that the nominal and actual coverage of confidence intervals is very close. However, the intervals produced by the improved methods simply expose the weakness of the IV estimators in the first place, that is the intervals can be very large (in fact, the intervals can be of infinite length). The fact that proper size intervals are very large simply says that if you properly measure sampling error, it can be very large for IV estimators. This reflects the fact that an IV approach uses only a portion of the total sample variability or information to estimate the structural parameter.

5.7.2 Choice models Much of the applied econometrics done in marketing is done in the context of a logit choice model of demand rather than a linear structural model. Much of the intuition regarding the problems with IV methods in linear models carries over to the nonlinear case. For example, the basic exclusion restriction that underlies the validity of an instrument also applied to a non-linear model. The idea that the instruments partition the variability of the endogenous rhs variable still applies. The major difference is that the GMM estimator is now motivated not just by the assumption that the structural errors are uncorrelated with the instruments but on a more fundamental notion that

5 Instruments and endogeneity

the instruments and the structural errors are conditionally independent. Replacing zero conditional correlation with conditional independence means that the moment conditions used to develop the GMM approach can be generated by not just assuming that the error terms are orthogonal to the instruments but also to any function of the instruments. This allows a greater flexibility than in the linear IV model. In the linear IV model, we need as many (or more) instruments as there are included rhs endogenous variables to identify the model. However, in a nonlinear model such as the choice model, any function of the instruments is also a valid instrument and can be used to identify model parameters. To make this concrete, consider a very simply homogeneous logit model. exp α cj,t + β mj,t + ξj,t Pr (j |t) = (52)

j exp α cj,t + β mj,t + ξj,t Here ξj,t is an unobservable, mj,t are the marketing mix variables for alternative j observed at time t, and cj,t are the characteristics of choice alternative j . The endogeneity problem comes from the possibility that firms set the marketing mix variables with partial knowledge of the unobservable demand shock and, therefore, the marketing mix variables are possibly a function of the ξj,t . Since the choice model is a linear index model, this is the same as suggesting that the unobservables are correlated with the marketing mix variables. The IV estimator would be based on the assumption that there exists a matrix, Z, of observations on valid instruments which are variables are conditionally independent of ξj,t .  

E ξt g (zt ) = 0 for any measurable function, g () zt is the vector of the p instrumental variables. As a practical matter, this means that one can use as valid instruments any polynomial function of the z variables and interactions between the instruments, greatly expanding the number of instruments. However, the identification achieved by expanding the set of instruments in this manner is primarily from the model functional form. To illustrate the problem with IV estimators for the standard choice model, we will consider the choice model in (52) along with a standard linear IV “first-stage” equation. mt = zt  + νt νt and ξt are correlated, giving rise to the classic omitted variable interpretation of the endogeneity problem. To examine the sampling properties of the IV estimator for this situation, we will consider the special case where there is only one endogenous marketing mix variable, there is only one instrument, and the choice model is a binary choice model. To generate the data, we will assume that the unobserved demand shocks joint normal with the errors in the IV equation. Pr (1) =

exp (α + βm + ξ ) 1 + exp (α + βm + ξ )



CHAPTER 2 Inference for marketing decisions

FIGURE 6 Sampling distributions for logit models.

m = δ z + ν    1 ξ ∼ N 0, ν ρ

ρ 1

Here we have arbitrarily written the probability of choice alternative 1. We used the same parameters as in the simulation of the weak instruments problem for the linear IV with one instrument, N = 100, ρ = .25, and δ is set for that first-stage Rsquared is 0.10. Fig. 6 shows the sampling distribution of the standard ML estimator which ignores the endogeneity (shown by the darker “magenta” histogram). We used a control-function approach to compute the IV estimator for this problem under the assumption of a linear first stage: We regressed the endogenous rhs variable, m, on the instruments and used the residual from this regression to create additional explanatory variables which were included in the logit model. In particular, we used the residual, the residual-squared, the residual-cubed, and exp of the residual as control functions. The sampling distribution of this estimator is shown by the yellow histogram. The sampling performance of the IV estimator is considerably inferior to that of the MLE which ignores the endogeneity problem. The fat tails of the IV estimator contribute to a RMSE of about twice that of the MLE. The IV estimator appears to be approximately unbiased, however, this goes away quickly if you increase the number of instruments, while the RMSE remains high.

5 Instruments and endogeneity

5.8 Conclusions regarding the statistical properties of IV estimators We have seen that an IV estimator can exhibit substantial finite sample bias and tremendous variability, particularly in the case of small samples, non-normal errors, and weak to moderate instruments. The failure of standard asymptotic theory applies not just to extreme case of very weak instruments, but also to cases of moderate strength instruments. All of these results assume that the instruments used are valid. If there are even “small” violations of the exclusion restriction (i.e. the instruments have a direct effect on y), then the statistical performance of the IV estimator degrades even further. The emphasis in the recent econometrics literature on instruments is on improved testing and confidence interval construction. This emphasis is motivated by a “theorytesting” mentality. That is, researchers want to test hypotheses regarding whether or not a causal effect exists. The emphasis is not on predicting y conditional on a change in x. This exposes an important difference between economics and marketing applications. In many (but not all) marketing applications, we are more interested in conditional prediction rather than testing a hypothesis. If our goal is to help the firm make better decisions, then the first step is to help the firm make better predictions of the effects of changes in marketing mix variables. One may actually prefer estimators which do not attempt to adjust for endogeneity (such as OLS) for this purpose. OLS can have a much lower RMSE than an IV method. In sum, IV methods are costly to apply and prone to specification errors. This serves to underscore the need for caution and the requirement that arguments in support of potential endogeneity bias and validity must be strong.

5.9 Endogeneity in models of consumer demand Much empirical research in marketing is directed toward calibrating models of product demand (see, for example, the Chintagunta and Nair, 2011 survey and Chapter 1 of this volume). In particular, there has been a great deal of emphasis on discrete choice models of demand for differentiated products (for an overview, see pp. 4178–4204 of Ackerberg et al., 2007). Typically, these are simple logit models which allow marketing mix variables to influence demand and account for heterogeneity. An innovation of Berry et al. (1995) was to include a market wide error term in this logit structure so that the aggregate demand system is not a deterministic function of product characteristics and marketing mix variables. exp α cj + β mj t + ξj t (53) MS (j |t) = J p (α, β) dαdβ

j =1 exp α cj t + β mj t + ξj t There are J products observed either in T time periods or in a cross section of T markets. cj is a vector characteristic of the j th product, mj t is a vector of market mix variables such as price and promotion for the j th product, and ξj t represents an error term which is often described as a “demand shock.” The fact that consumers are heterogeneous is reflected by integrating the logit choice probabilities over a distribution of parameters which represents the distribution of preferences in the market. This ba-



CHAPTER 2 Inference for marketing decisions

sic model represents a sort of intersection between marketing and I/O and provides a useful framework to catalog the sorts of instruments used in the literature.

5.9.1 Price endogeneity (53) provides a natural motivation for concerns regarding endogeneity using an omitted variables interpretation. If we could observe the ξj t variable, then we would simply include this variable in the model and we would be able to estimate the β parameters which represent the sensitivity to marketing mix variables. However, researchers do not observe ξj t and it is entirely possible that firms have information regarding ξj t and set marketing mix variables accordingly. One of the strongest argument made for endogeneity is the argument of Berry et al. (1995) that if ξj t represents an unobserved product characteristic (such as some sort of product quality) that we would expect that firms would set price as a function of ξj t as well as of the observed characteristics. This a very strong argument, when applied in marketing applications, as the observed characteristics of many consumer products are often limited to packaging, package size, and a subset of ingredients. For consumer durable goods, the observed characteristics are also limited as it is difficult to quantify design, aesthetic, and performance characteristics. We might expect that price and unobserved quality are positively correlated, giving rise to a classic downward endogeneity bias in price sensitivity. This would result in what appears to be sub-optimal prices. There are many possible interpretations of the ξj t terms other than the interpretation as an unobserved product characteristic. If the demand is observed in crosssection of markets, we might interpret the ξj t as unobserved market characteristics that make particular brands more or less attractive in this market. If the t index is time, then others have argued that the ξj t represent some sort of unobserved promotional or advertising shock. These arguments for endogeneity bias in the price coefficient have led to the search for valid instruments for the price variable. The obvious place to look for instruments is the supply side which consists of cost and competitive information. The idea here is that costs do not affect demand and therefore serve to push around price (via some sort of mark-up equation) but are uncorrelated with the unobserved demand shock, ξj t . Similarly, the structure of competition should be a driver of price but not of demand. If a particular product lies in a crowded portion of the product characteristics space, then we might expect smaller mark-ups than a product that is more isolated. The problem with cost-based instruments is lack of variability and observability. For some types of consumer products, input costs such as raw material costs may be observable and variable, but other parts of marginal cost may be very difficult to measure. For example, labor costs, measured by the Bureau of Labor Statistics, are based on surveys with a potentially high measurement error. Typically, those costs that are observable do not vary by product so that input costs are not usable as instruments for the highly differentiated product categories studied in marketing applications. If the data represent a panel of markets observed over time, then the suggestion of Hausman (1996) can be useful. The idea here is that the demand shocks are not

5 Instruments and endogeneity

correlated across markets but that costs are.30 If this is the case, then the prices of products in other markets would be valid instruments. Hausman introduced this idea to get around the problem that observable input costs don’t vary by product. To evaluate the usefulness and validity of the Hausman approach, one must take a critical view of what the demand shocks represent. If these error terms represent unobservable market level demand characteristics which do not vary over time, then simply including market fixed effects would eliminate the need for instruments. One has to argue that the unobserved demand shocks are varying both by market and by time period in the panel. For this interpretation, authors often point to unobserved promotional efforts such as advertising and coupon drops. If these promotional efforts have a much lower frequency than the sampling frequency of the data (e.g. feature promotions are planned quarterly but we observe weekly demand data), then it is highly unlikely that these unobservables explain much of the variation in demand and that this source of endogeneity concerns is weak. For products with few observable characteristics and for cross-sectional data, Berry et al. (1995) make a strong argument for price endogeneity. However, their arguments for the use of characteristics of other products as potential instruments are not persuasive for marketing applications. Their argument is that the characteristics of competing products will influence mark-up independent of demand shocks. This may be reasonable. However, their assumption that firms observed characteristics are exogenous and set independently of the unobservable characteristic is very likely to be incorrect. Firms set the bundle of both observed and unobserved characteristics jointly. Thus, the instruments proposed by Berry et al. (1995) are unlikely to be valid. With panel data, there is no need to use instruments as simple product specific fixed effects would be sufficient to remove the “endogeneity” bias problem as long as the unobserved product characteristics do not vary across time.

5.9.2 Conclusions regarding price endogeneity Price endogeneity has received a great deal of attention in the recent marketing literature. There is no particular reason to single out price as the one variable in the marketing mix which has potentially the greatest problems of endogeneity bias. In fact, the source of variation in prices in most marketing datasets consists of cost variation (including wholesale price variation) and the ubiquitous practice of temporary price promotions or sales. Within the same market over time, it is hard to imagine what the unobservable demand shocks are that vary so much over time and by brand. Retailers set prices using mark-up rules and other heuristics that do not depend on market wide variables. Cost variables are natural price instruments but lack variation over time and by brand. Wholesale prices, if used as instruments, will confuse long and short run price effects. We are not aware of any economic arguments which can justify the use of lagged prices as instruments. In summary, we believe that, with panel or time-series data, the emphasis on price endogeneity has been misplaced. 30 Given that, for many products, there are national advertising and promotional campaigns suggests that

the Hausman idea will only work if there are advertising expenditure variables included in the model.



CHAPTER 2 Inference for marketing decisions

5.10 Advertising, promotion, and other non-price variables While the I/O literature has focused heavily on the possibility of price endogeneity, there is no reason to believe, a priori, that the endogeneity problem is confined to price. In packaged goods, demand is stimulated by various “promotional” activities which an include what amount to local forms of advertising from display signage, direct mail, and newspaper inserts. In the pharmaceutical and health care products industry, large and highly compensated sales forces “call on” doctors and other health care professionals to promote products (this is often called “detailing”). In many product categories, there is very little price variation but a tremendous expenditure of effort on promotional activities such as detailing. This means that for many product categories, the advertising/promotional variables are more important than price. An equally compelling argument can be made that these non-price marketing mix variables are subject to the standard “omitted” variable endogeneity bias problem. For example, advertising would seem to be a natural variable that is chosen as a function of demand unobservables. Others have suggested that advertising is determined simultaneously along with sales as firms set advertising expenditures as a function of the level of sales. In fact, the classical article (Bass, 1969) uses linear simultaneous equations models to capture this “feedback” mechanism for advertising. The standard omitted variables arguments apply no less forcefully to non-price marketing mix variables. This motivates a search for valid instruments for advertising and promotion. Other than costs of advertising and promotion, there is no set of instruments that naturally emerge as valid and strong instruments. Even the cost variables are unlikely to be brand or product-specific and may vary only slowly over time, maximizing the “weak” instruments problem. We have observed that researchers have argued persuasively that some kinds of panel data can be used to infer causal effects of advertising by using fixed effects to control for various concerns that changes advertising allocations over time or that specific markets receive allocations that depend on possible responsiveness to the ad or campaign in question (see, for example, Klapper and Hartmann, 2018). In the panel setting, researchers limit the source of variation so as to reduce concerns for endogeneity bias. The IV approach is to affirmatively seek out “clean” or exogenous variation. In the same context of measuring return to Super Bowl ads, StephensDavidowitz et al. (2015) use whether or not the home team is in the Super Bowl as an explicit instrument (exposure to the ad because viewership changes if the home team is in the Super Bowl which they argue is genuinely unpredictable or random at the point at which advertisers bid on Super Bowl slots. This is clearly a valid instrument in the same way as Angrist’s Vietnam draft lottery is a valid instrument and no further proof is required. However, such truly random instruments are extremely rare.

5.11 Model evaluation The purpose of causal inference in marketing applications is to inform firm decisions. As we have argued, in order to optimize actions of the firm, we must consider counterfactual scenarios. This means that the causal model must predict well in con-

6 Conclusions

ditions that can be different from those observed in the data. The model evaluation exercise must validate the model’s predictions across a wide range of different policy regimes. If we validate the model under a policy regime that is the same or similar to the observational data, then that validation exercise will be uninformative or even misleading. To see this point clearly, consider the problem of making causal inferences regarding a price elasticity. The object of causal inference is the true price elasticity in a simple log-log approximation. ln Qt = α + η ln Pt + εt Imagine that there is an “endogeneity” problem in the observational data in which the firm has been setting price with partial knowledge of the demand shocks which are in the error term. Suppose further, that the firm raises price when it anticipates a positive demand shock. This means that a OLS estimate of the elasticity will be too small and we might conclude, erroneously, that the firm should raise its price even if the firm is setting prices optimally. Suppose we reserve a portion of our observational data for out-of-sample validation. That is, we will fit the log-log regression on observations, 1, 2, . . . , T0 , reserving observations T0 + 1, . . . , T for validation. If we were to compare the performance of the inconsistent and biased OLS estimator of the price elasticity with any valid causal estimate using our “validation” data, we would conclude that OLS is superior using anything like the MSE metric. This is because OLS is a projection-based estimator that seeks to minimize mean squared error. The only reason OLS will fare poorly in prediction in this sort of exercise is if the OLS model is highly over-parameterized and the OLS procedure will over-fit the data. However, the OLS estimator will yield non-profit maximizing prices if used in a price optimization exercise because it is inconsistent for the true causal elasticity parameter. Thus, we must devise a different validation exercise in evaluating causal estimates. We must either find different policy regimes in our observational data or we must conduct a validation experiment.

6 Conclusions Marketing is an applied field where the emphasis is on providing advice to firms on how to optimize their interaction with customers. The decision-oriented emphasis motivates an interest in both inference paradigms compatible with decision making as well as response functions which are nonlinear. Much of the recent emphasis in econometrics is focused on estimating effects of a policy variable via linear approximations or even via a “local” treatment effect or (LATE). Unfortunately, local treatment effects are only a beginning of an understanding of policy effects and policy optimization. This means that, in marketing, we will have to impose something like a parametric model (which can be achieved via strong priors and non-parametric approaches) in order to make progress on the problem of optimizing marketing policies.



CHAPTER 2 Inference for marketing decisions

On the positive side, marketing researchers have access to an unprecedented amount of detailed observational data regarding the reactions of customers to a ever increasing variety of firm actions (for example, the many types of advertising or marketing communications possible today). Large scale randomized experimentation holds out the possibility of valid causal inferences regarding the effects of marketing actions. Thus, the amount of both observational and experimental data at the disposal of a marketing researcher is truly impressive and growing as new sources of spatial (via phone or mobile device pinging) and other types of data are possible. Indeed, some data ecosystems such as Google/Facebook/Amazon in the U.S. and Alibaba/JD in China aim to amass data on purchases, search and product consideration, advertising exposure, and social interaction into one large dataset. At the same time, firms are becoming increasingly sophisticated in targeting customers based on information regarding preferences and responsiveness to marketing actions. This means that the evaluation of targeted marketing actions may be difficult or even impossible with even the richest observational data and may require experimentation. We must use observational data to the greatest extent possible as it is impossible to optimize fully marketing actions on the basis solely of experimentation. The emphasis in the econometrics literature has been on using only a subset of the variation in observational data so as to avoid concerns that invalidate causal inference. We can ill afford a completely purist point of view that we should only use a tiny fraction of the variation in our data to estimate causal effects or optimize marketing policies. Instead, our view is that we can restrict variation in observational data only when there are strong prior reasons to expect that use of a given dimension of variation will produce inconsistent estimates of marketing effects. Experimentation must be combined with observational data to achieve the goals of marketing. It is highly unlikely that randomized experimentation will ever completely replace inferences based on observational data. Many problems in marketing are not characterized by attempts to infer about an average effect size but, rather, to optimize firm actions over a wide range of possibilities. Optimization cannot be achieved in any realistic situation only by experimental means. It is likely, therefore, that experiments should play a role in estimating some of the critical effects but models calibrated on observational data will still be required to make firm policy recommendations. Experiments could also be used to test key assumptions in the model such as functional form or exogeneity assumptions without requiring that policy optimization be the direct result of experimentation.

References Ackerberg, D., Benkard, C.L., Berry, S., Pakes, A., 2007. Econometric tools for analyzing market outcomes. In: Handbook of Econometrics. Elsevier Science, pp. 4172–4271 (Chap. 63). Allenby, G.M., Rossi, P.E., 1999. Marketing models of consumer heterogeneity. Journal of Econometrics 89, 57–78. Angrist, J.D., Krueger, A.B., 1991. Does compulsory school attendance affect schooling and earnings? The Quarterly Journal of Economics 106, 979–1014.


Angrist, J.D., Pischke, J.-S., 2009. Mostly Harmless Econometrics. Princeton University Press, Princeton, NJ, USA. Antoniak, C.E., 1974. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics 2 (6), 1152–1174. Bass, F.M., 1969. A simultaneous equation study of advertising and sales of cigarettes. Journal of Marketing Research 6 (3), 291–300. Bernardo, J.M., Smith, A.F.M., 1994. Bayesian Theory. John Wiley & Sons. Berry, S., Levinsohn, J., Pakes, A., 1995. Automobile prices in market equilibrium. Econometrica 63 (4), 841–890. Blake, T., Nosko, C., Tadelis, S., 2015. Consumer heterogeneity and paid search effectiveness: a large scale field experiment. Econometrica 83, 155–174. Chen, Y., Yang, S., 2007. Estimating disaggregate models using aggregate data through augmentation of individual choice. Journal of Marketing Research 44, 613–621. Chintagunta, P.K., Nair, H., 2011. Discrete-choice models of consumer demand in marketing. Marketing Science 30 (6), 977–996. Conley, T.G., Hansen, C.B., Rossi, P.E., 2012. Plausibly exogenous. Review of Economics and Statistics 94 (1), 260–272. Deaton, A., Cartright, N., 2016. Understanding and Misunderstanding Randomized Controlled Trials. Discussion Paper 22595. NBER. Dube, J., Hitsch, G., Rossi, P.E., 2010. State dependence and alternative explanations for consumer inertia. The Rand Journal of Economics 41 (3), 417–445. Dubé, J.-P., Fox, J.T., Su, C.-L., 2012. Improving the numerical performance of static and dynamic aggregate discrete choice random coefficients demand estimation. Econometrica 80 (5), 2231–2267. Dubé, J.-P., Hitsch, G., Rossi, P.E., 2018. Income and wealth effects in the demand for private label goods. Marketing Science 37, 22–53. Dubé, J.-P., Misra, S., 2018. Scalable Price Targeting. Discussion Paper. Booth School of Business, University of Chicago. Eckles, D., Bakshy, E., 2017. Bias and High-Dimensional Adjustment in Observational Studies of Peer Effects. Discussion Paper. MIT. Fennell, G., Allenby, G.M., Yang, S., Edwards, Y., 2003. The effectiveness of demographic and psychographic variables for explaining brand and product use. Quantitative Marketing and Economics 1, 223–244. Fruhwirth-Schnatter, S., 2006. Finite Mixture and Markov Switching Models. Springer. Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B., 2004. Bayesian Data Analysis. Chapman and Hall. George, E.I., McCulloch, R.E., 1997. Approaches for Bayesian variable selection. Statistica Sinica 7, 339–373. Gilbride, T.J., Allenby, G.M., 2004. A choice model with conjunctive, disjunctive, and compensatory screening rules. Marketing Science 23 (3), 391–406. Gordon, B.R., Zettelmeyer, F., 2017. A Comparison of Approaches to Advertising Measurement. Discussion Paper. Northwestern University. Griffin, J., Quintana, F., Steel, M.F.J., 2010. Flexible and nonparametric modelling. In: Geweke, J., Koop, G., Dijk, H.V. (Eds.), Handbook of Bayesian Econometrics. Oxford University Press. Hansen, C., Hausman, J., Newey, W., 2008. Estimation with many instrumental variables. Journal of Business and Economic Statistics 26 (4), 398–422. Hansen, L.P., 1982. Large sample properties of generalized method of moments estimators. Econometrica 50 (4), 1029–1054. Hartmann, W.R., Nair, H.S., Narayanan, S., 2011. Identifying causal marketing mix effects using a regression discontinuity design. Marketing Science 30, 1079–1097. Hausman, J., 1996. The valuation of new goods under perfect and imperfect competition. In: Bresnahan, T., Gordon, R. (Eds.), The Economics of New Goods, vol. 58. University of Chicago, pp. 209–237. Hayashi, F., 2000. Econometrics. Princeton University Press. Heckman, J., Singer, B., 1984. A method for minimizing the impact of distributional assumptions in econometric models. Econometrica 52 (2), 271–320.



CHAPTER 2 Inference for marketing decisions

Heckman, J., Vytlacil, E.J., 2007. Econometric evaluation of social programs. In: Heckman, J., Leamer, E. (Eds.), Handbook of Econometrics, vol. 6B. Elsevier, pp. 4779–4874. Hitsch, G., Misra, S., 2018. Heterogeneous Treatment Effects and Optimal Targeting Policy Evaluation. Discussion Paper. Booth School of Business, University of Chicago. Hoch, S.J., Dreze, X., Purk, M.E., 1994. EDLP, Hi-Lo, and margin arithmetic. Journal of Marketing 58 (4), 16–27. Imbens, G.W., Kolesar, M., 2016. Robust standard errors in small samples: some practical advice. Review of Economics and Statistics 98 (4), 701–712. Imbens, G.W., Lemieux, T., 2008. Regression discontinuity designs: a guide to practice. Journal of Econometrics 142, 807–828. Imbens, G.W., Rubin, D.B., 2014. Causal Inference. Cambridge University Press. Jiang, R., Manchanda, P., Rossi, P.E., 2009. Bayesian analysis of random coefficient logit models using aggregate data. Journal of Econometrics 149, 136–148. Johnson, G.A., Lewis, R.A., Nubbemeyer, E.I., 2017. Ghost ads: improving the economics of measuring online ad effectiveness. Journal of Marketing Research 54, 867–884. Klapper, D., Hartmann, W.R., 2018. Super bowl ads. Marketing Science 37, 78–96. Lewis, R.A., Rao, J.M., 2014. The Unfavorable Economics of Measuring the Returns to Advertising. Discussion Paper. NBER. Lodish, L., Abraham, M., 1995. How T.V. advertising works: a meta-analysis of 389 real world split cable T.V. advertising experiments. Journal of Marketing Research 32 (2), 125–139. Manchanda, P., Rossi, P.E., Chintagunta, P.K., 2004. Response modeling with nonrandom marketing-mix variables. Journal of Marketing Research 41, 467–478. McFadden, D.L., Train, K.E., 2000. Mixed MNL models for discrete response. Journal of Applied Econometrics 15, 447–470. Moreira, M.J., 2003. A conditional likelihood ratio test for structural models. Econometrica 71, 1027–1048. Musalem, A., Bradlow, E.T., Raju, J.S., 2009. Bayesian estimation of random-coefficients choice models using aggregate data. Journal of Applied Econometrics 24, 490–516. Narayanan, S., Nair, H., 2013. Estimating causal installed-base effects: a bias-correction approach. Journal of Marketing Research 50 (1), 70–94. Neyman, J., 1990. On the application of probability theory to agricultural experiments: essay on principles. Statistical Science 5, 465–480. Nickell, S., 1981. Biases in dynamic models with fixed effects. Econometrica 49 (6), 1417–1426. Park, T., Casella, G., 2008. The Bayesian lasso. Journal of the American Statistical Association 103 (482), 681–686. Petrin, A., Train, K., 2010. Control function corrections for unobserved factors in differentiated product models. Journal of Marketing Research 47 (1), 3–13. Robert, C.P., Casella, G., 2004. Monte Carlo Statistical Methods, second ed. Springer. Rossi, P.E., 2014a. Bayesian Non- and Sem-Parametric Methods and Applications. The Econometric and Tinbergen Institutes Lectures. Princeton University Press, Princeton, NJ, USA. Rossi, P.E., 2014b. Even the rich can make themselves poor: a critical examination of IV methods in marketing applications. Marketing Science 33 (5), 655–672. Rossi, P.E., Allenby, G.M., McCulloch, R.E., 2005. Bayesian Statistics and Marketing. John Wiley & Sons. Sahni, N., 2015. Effect of temporal spacing between advertising exposures: evidence from online field experiments. Quantitative Marketing and Economics 13 (3), 203–247. Scott, S.L., 2014. Multi-Armed Bandit Experiments in the Online Service Economy. Discussion Paper. Google Inc. Shapiro, B., 2018. Positive spillovers and free riding in advertising of pharmaceuticals: the case of antidepressants. Journal of Political Economy 126 (1). Stephens-Davidowitz, S.H., Varianc, H., Smith, M.D., 2015. Super Returns to Super Bowl Ads? Discussion Paper. Google Inc. Stock, J.H., Wright, J.H., 2000. GMM with weak identification. Econometrica 68 (5), 1055–1096.


Stock, J.H., Wright, J.H., Yogo, M., 2002. A survey of weak instruments and weak identification in generalized method of moments. Journal of Business and Economic Statistics 20 (4), 518–529. Woolridge, J.M., 2010. Econometric Analysis of Cross Section and Panel Data. MIT Press, Cambridge, MA, USA.



Economic foundations of conjoint analysis


Greg M. Allenbya , Nino Hardta , Peter E. Rossib,∗ a Fisher

b Anderson

College of Business, Ohio State University, Columbus, OH, United States School of Management, University of California at Los Angeles, Los Angeles, CA, United States ∗ Corresponding author: e-mail address: [email protected]

Contents 1 Introduction ...................................................................................... 2 Conjoint analysis ................................................................................ 2.1 Discrete choices ................................................................... 2.2 Volumetric choices ................................................................ 2.3 Computing expected demand .................................................... 2.4 Heterogeneity ...................................................................... 2.5 Market-level predictions .......................................................... 2.6 Indirect utility function ........................................................... 3 Measures of economic value .................................................................. 3.1 Willingness to pay (WTP) ......................................................... 3.1.1 WTP for discrete choice ....................................................... 3.1.2 WTP for volumetric choice .................................................... 3.2 Willingness to buy (WTB) ......................................................... 3.2.1 WTB for discrete choice ....................................................... 3.2.2 WTB for volumetric choice .................................................... 3.3 Economic price premium (EPP) ................................................. 4 Considerations in conjoint study design..................................................... 4.1 Demographic and screening questions ......................................... 4.2 Behavioral correlates .............................................................. 4.3 Establishing representativeness ................................................. 4.4 Glossary ............................................................................. 4.5 Choice tasks ........................................................................ 4.6 Timing data ......................................................................... 4.7 Sample size......................................................................... 5 Practices that compromise statistical and economic validity ........................... 5.1 Statistical validity ................................................................. 5.1.1 Consistency ...................................................................... 5.1.2 Using improper procedures to impose constraints on partworths .... 5.2 Economic validity .................................................................. 5.2.1 Non-economic conjoint specifications ...................................... 5.2.2 Self-explicated conjoint ........................................................ 5.2.3 Comparing raw part-worths across respondents ......................... Handbook of the Economics of Marketing, Volume 1, ISSN 2452-2619, https://doi.org/10.1016/bs.hem.2019.04.002 Copyright © 2019 Elsevier B.V. All rights reserved.

152 154 154 156 158 159 160 160 161 161 162 163 163 164 164 164 165 166 167 168 170 171 173 174 175 175 176 177 177 177 178 178



CHAPTER 3 Economic foundations of conjoint analysis

5.2.4 Combining conjoint with other data ......................................... 6 Comparing conjoint and transaction data ................................................... 6.1 Preference estimates .............................................................. 6.2 Marketplace predictions .......................................................... 6.3 Comparison of willingness-to-pay (WTP) ....................................... 7 Concluding remarks............................................................................. Technical appendix: Computing expected demand for volumetric conjoint................ References............................................................................................

179 179 180 184 186 188 189 190

1 Introduction Conjoint analysis is a survey-based method used in marketing and economics to estimate demand in situations where products can be represented by a collection of features and characteristics. It is estimated that 14,000 conjoint studies are conducted yearly by firms with the goal of valuing product features and predicting the effects of changes in formulation, price, advertising, or method of distribution (Orme, 2010). Conjoint analysis is often the only practical solution to the problem of predicting demand for new products or for new features of products that are not present in the marketplace. The economic foundations of conjoint analysis can be traced to seminal papers on utility measurement in economics (Becker et al., 1964) and mathematical psychology (Luce and Tukey, 1964). Conjoint analysis was introduced to the marketing literature by Green and Rao (1971) as an approach to take ranked input data and estimate utility part-worths for product attributes and their levels (e.g., amount of computer memory). The use of dummy-variables to represent the utility of attribute-levels provided a flexible approach to measure preferences without imposing unnecessary assumptions on the form of the utility function. Later, nonparametric and semiparametric models of choice were developed to avoid incorrect assumptions about the distribution of the error term (Matzkin, 1991, 1993). This earlier work does not lend itself to models of heterogeneity that employ a random-effect specification across respondents (Allenby and Rossi, 1998) which have become popular in marketing. Virtually all conjoint studies done today are in the form of a “choice-based” conjoint in which survey respondents are offered the choice between sets of products represented by combinations of well-defined product attributes or features. For many durable good products, the assumption of mutually exclusive unit demand might be appropriate. In these cases, respondents can only choose one of the product offerings (or the “outside” alternative) with unit demand. In other situations, it may be more reasonable to allow respondents to choose a subset of products and to consume continuous quantities. This form of conjoint analysis is called “volumetric” conjoint and is not practiced very extensively due to lack of software for analysis of volumetric conjoint data. As a companion to this chapter, we hope to promote use of volumetric conjoint designs by providing general purpose software in an R package. Both standard and volumetric conjoint analysis can best be viewed as a simulation of market conditions set up via an experimental design. As such, choice-based conjoint (if executed well) is closer to revealed rather than stated preferences. Virtually

1 Introduction

all researchers in marketing accept the premise that choice-based conjoint studies offer superior recovery of consumer preferences than a pure stated preference method in which direct elicitation of preferences is attempted. However, there is remarkably little research in the conjoint literature that attempts to compare preferences estimated from choice-based conjoint with preferences estimated from consumer panel observational data (cf. Louviere and Hensher, 1983). Even with existing products for which marketplace data is observed, there are many situations in which it is not possible to identify consumer preferences. In many markets, there is very little price variation (at least in the short run where preferences might be assumed to be stable) and the set of existing products represent only a very sparse set of points in the product characteristic space. In these situations, conjoint has a natural appeal. In still other situations, only aggregate demand data is available. The considerable influence of the so-called BLP (Berry et al., 1995) approach notwithstanding, many recognize that it is very difficult to estimate both preferences and the distribution of preferences over consumers from only aggregate data. Econometricians have focused considerable attention on the problem of “endogeneity” in demand estimation. With unobservable product characteristics or other drivers of demand, not all price variation may be usable for preference inference. In these situations, econometricians advocate a variety of instrumental variable and other techniques which further restrict the usable portion of price variation. In contrast, conjoint data has all product characteristics well-specified and the levels of price and characteristics are chosen by experimental design. Therefore, there is no “endogeneity” problem in conjoint analysis - all variation in both product attributes and price is exogenous and usable in preference estimation. This is both a great virtue as well as a limitation of conjoint analysis. Conjoint analysis avoids the problems that plague valid demand inference with observational data but at the cost of requiring that all product attributes are well-specified and can be described to the survey respondents in a meaningful way. Conjoint researchers have shown great inventiveness in describing product features but always face the limitation that the analysis is limited to the features included. Respondents are typically instructed to assume that all unspecified features are constant across the choice alternatives presented in the conjoint survey. In some situations, this assumption can seem strained. It should be emphasized that conjoint analysis can only estimate demand. Conjoint survey data, alone, cannot be used to compute market equilibrium outcomes such as market prices or equilibrium product positioning in characteristics space. Clearly, supply assumptions and cost information must be added to conjoint data in order to compute equilibrium quantities. Conjoint practitioners have chosen the unfortunate term “market simulation” to describe demand predictions based on conjoint analyses. These practices are not simulations of any market outcome and must be understood for their limitations as we discuss below. The purpose of this chapter in the Handbook of the Economics of Marketing is to provide an introduction to the economic foundations of modern conjoint analysis. We begin in Section 2 with a discussion of the economic justification for conjoint analysis, examining economic models of choice that can be adapted for conjoint analysis.



CHAPTER 3 Economic foundations of conjoint analysis

Section 3 discusses economic measures of value that can be derived from a conjoint study. Section 4 discusses survey requirements for conducting valid conjoint analysis, covering topics such as the use of screening questions and the task of selecting and describing product attributes. Section 5 considers aspects of current practice in conjoint analysis that are not supported by economic theory and should be avoided. Section 6 provides evidence that demand data and conjoint data provide similar estimates of feature value. Section 7 offers concluding remarks.

2 Conjoint analysis Conducting a valid conjoint analysis requires both a valid procedure for collecting conjoint data as well as a valid model for analysis. Conjoint data is not the same as stated preference data in that respondents in conjoint surveys are not simply declaring what they know about their utility or, equivalently, their willingness to pay for products and their features. Instead, respondents react to hypothetical purchase scenarios involving products, which may not currently exist in the marketplace, that are specifically described and priced. Respondents provide expected demand quantities across a variety of buying scenarios in which product attributes and prices change. Well-executed conjoint surveys can approximate revealed preference data available from sources like shopper panels by instructing respondents to think about a specific buying context and focus on a subset of product alternatives. The economic foundation of conjoint analysis rests on using valid economic models for analyzing this data. In this chapter we discuss two models based on direct utility maximization – the discrete choice model based on extreme value errors that leads to the standard logit specification, and a volumetric demand model that allows for both corner and interior solutions as well as continuous demand quantities. In both specifications, coefficients associated with the marginal utility of an offering are parameterized in terms of product characteristics.

2.1 Discrete choices A simple model of choice involves respondents selecting just one choice alternative. Examples include demand for durable goods, such as a car, where only one good is demanded. Models for the selection of one good are referred to as discrete choice models (Manski et al., 1981) and can be motivated by a linear1 direct utility function: u (x, z) =

ψ k xk + ψz z



1 The linearity assumption is only an approximation in that we expect preferences to exhibit satiation. However, since demand quantities are restricted to unit demand, satiation is not a consideration.

2 Conjoint analysis

where xk denotes the demand quantities, constrained to be either zero or one for the choice alternative k, z is an outside good, and ψk is the marginal utility of consuming the kth good. Respondents are assumed to make choices to maximize their utility u(x, z) subject to budget constraint:  pk x k + z = E (2) k

where the price of the outside good is set to $1.00. The outside good represents the decision to allocate some or all of the budgetary allotment E outside of the k goods in the choice task, including the option to delay a purchase. The budgetary allotment is the respondent’s upper-limit of expenditure for the goods under study and is not meant to indicate his or her annual income. An additive error term is introduced into the model specification for each good to allow for factors affecting choice that are known to the respondent but not observed by the researcher. Assuming distributional support for the error term on the real line allows the model to rationalize any pattern of respondent choices. There are some error realizations, however unlikely, that lead to an observed choice being utility maximizing. The utility for selecting good j is therefore equal to:     (3) u xj = 1, z = E − pj = ψj + ψz E − pj + εj and the utility for selecting the ‘no-choice’ option is: u (x = 0, z = E) = ψz E + εz


Choice probabilities are obtained from the utility expressions by integrating over regions of the error space that coincide with a particular choice having highest utility. Assuming extreme value EV(0,1)2 errors leads to the familiar logit choice probability:     Pr (j ) = Pr ψj + ψz E − pj + εj > ψk + ψz (E − pk ) + εk for any k = j    exp ψj + ψz E − pj     =  exp ψz (E) + k exp ψk + ψz (E − pk )   exp ψj − ψz pj   (5) =  1 + k exp ψk − ψz pk The discrete choice model is used extensively in conjoint applications because of its computational simplicity. Observed demand is restricted to two points {0, 1} for each of the inside goods (x) and the constant marginal utility assumption leads to the

2 The cdf of the EV(0,1) error term is F(x) = exp[-exp(-x)] with location parameter equal to zero and scale parameter equal to one.



CHAPTER 3 Economic foundations of conjoint analysis

budgetary allotment dropping out of the expression for the choice probability.3 The marginal utility for the outside good, ψz , is interpreted as a price coefficient and measures the disutility for paying a higher price. Choice alternatives with prices that are larger than the budgetary allotment are screened-out of the logit choice probability. That is, only goods for which pk ≤ E are included in the logit probability expression (see Pachali et al., 2017). Finally, expected demand is equal to the choice probability, which is convenient for making demand predictions. Conjoint analysis uses the discrete choice model by assuming marginal utility, ψj , is a linear function of brand attributes: ψj = aj β


where aj denotes the attributes of good j . The attributes are either coded using a dummy variable specification where one of the attribute-levels is selected as the null level of the attribute, or using effects-coding that constrain the sum of coefficients to equal zero for an attribute (Hardy, 1993). For either specification, conjoint analysis measures the value of changes among the levels of an attribute. Since the valuation of the attribute-levels are jointly determined through Eq. (6), the marginal utilities for a respondent are comparable across the attributes and features included in the analysis. The marginal utility of a product feature can therefore be compared to the utility of changes in the levels of other attributes, including price.

2.2 Volumetric choices A more general model for conjoint analysis is one that introduces non-linearities into the utility function (Allenby et al., 2017): u (x, z) =

 ψk k


ln (γ xk + 1) + ln(z)


where γ is a parameter that governs the rate of satiation of the good. The marginal utility for the inside and outside goods is: ψj ∂u (x, z) = ∂xj γ xj + 1 ∂u (x, z) 1 uz = = ∂z z

uj =


The marginal utility of the good is equal to ψj when xj = 0 and decreases as demand increases (xj > 0) and as the satiation parameter (γ ) increases. The general solution to maximizing the utility in (7) subject to the budget constraint in (2) is to employ the

3 Constant marginal utility, however, is not required to obtain a discrete choice model as shown by Allenby and Rossi (1991), wherein the budgetary allotment does not drop out.

2 Conjoint analysis

Kuhn-Tucker (KT) optimality conditions: uj uk = pk pj uj uk > if xk > 0 and xj = 0 then λ = pk pj if xk > 0 and xj > 0 then λ =

for all k and j for all k and j


Assuming that ψj = exp[aj β + εj ] and solving for εj leads to the following expression for the KT conditions: εj = gj εj < gj

if if

xj > 0 xj = 0

where gj = −aj β + ln(γ xj + 1) + ln

(10) (11)

pj E − p x


The assumption of i.i.d. extreme-value errors, i.e., EV(0, σ ),4 results in a closedform expression for the probability that R of N goods are chosen. The error scale (σ ) is identified in this model because price enters the specification without a separate price coefficient. We assume there are N choice alternatives and R items are chosen: x1 , x2 , . . . , xR > 0,

xR+1 , xR+2 , . . . , xN = 0.

The likelihood (θ ) of the model parameters is proportional to the probability of observing n1 chosen goods (n1 = 1, . . . , R) and n2 goods with zero demand (n2 = R + 1, . . . , N ). The contribution to the likelihood of the chosen goods is in the form of a probability density because of the equality condition in (10) while the goods not chosen contribute as a probability mass because of the inequality condition in (11). We obtain the likelihood by evaluating the joint density of model errors at gj for the chosen goods and integrating the joint density to gi for the goods that are not chosen: (θ ) ∝ p(xn1 > 0, xn2 = 0|θ )


gR+1 ··· f (g1 , . . . , gR , εR+1 , . . . , εN )dεR+1 , . . . , dεN = |JR | −∞ ⎫⎧ ⎫ ⎧−∞ R N ⎬ ⎬ ⎨ ⎨      exp(−gj /σ ) = |JR | exp −e−gi /σ exp −e−gj /σ ⎭⎩ ⎭ ⎩ σ j =1 i=R+1 ⎧ ⎫   N R ⎨  exp(−gj /σ ) ⎬ exp(−gi /σ ) = |JR | exp − ⎩ ⎭ σ j =1

4 F (x) = exp[−e−x/σ ].




CHAPTER 3 Economic foundations of conjoint analysis

where f (·) is the joint density for ε and |JR | is the Jacobian of the transformation from random-utility error (ε) to the likelihood of the observed data (x), i.e., |JR | = |∂εi /∂xj |. For this model, the Jacobian is equal to: |JR | =

R   k=1

γ γ xk + 1

 R k=1

γ xk + 1 pk · +1 γ E − p x

The expression for the likelihood of the observed demand vector xt is seen to be the product of R “logit” expressions multiplied by the Jacobian, where the purchased quantity, xj is part of the value (gj ) of the choice alternative. To obtain the standard logit model of discrete choice we set R = 1, set the scale of the error term to one (σ = 1), and allow the expression for gj to include a price coefficient (i.e., gj = −aj β − ψz pj ). The Jacobian equals one for a discrete choice model because demand (x) enters the KT conditions through the corner solutions only corresponding to mass points. Variation in the specification of the choice model utility function and budget constraint results in different values of (gj ) and the Jacobian |JR |, but not to the general form of the likelihood, i.e., ⎧ ⎫⎧ ⎫ R N ⎨ ⎬⎨  ⎬ p(x|θ ) = |JR | f (gj ) F (gi ) (13) ⎩ ⎭⎩ ⎭ j =1


The analysis of conjoint data arising from either discrete or volumetric choices proceeds by relating utility parameters (ψj ) to product attributes (aj ) as in Eq. (6). It is also possible to specify non-linear mappings from attributes to aspects of the utility function as discussed, for example, in Kim et al. (2016).

2.3 Computing expected demand Demand (xht ) for subject h at time t is a function of parameters of the demand model (θh ), a realization of the vector of error terms (εht ), characteristics of the available choice set (At ), and prices (pt ). We need to obtain expected demand quantities for deriving various measures of economic value. For the discrete choice model, expected demand is expressed as a choice probability, while expected demand for the volumetric demand model does not have a closed-form solution. We therefore introduce D, the function of demand for one realization of model parameters θh and one realization of the error term εht , given the characteristics of the set of alternatives At and corresponding prices pt : xht = D (θh , εht |At , pt )


  Here, θh = βh , ψh,z = βhp for the discrete choice model and θh = {βh , γh , Eh , σh } for the volumetric model. Expected demand is obtained by integrating out the error

2 Conjoint analysis

term and posterior distribution of model parameters θh . First, consider integrating over the error term:

E (xht |θh ) = D (θh , εht |At , pt )p(εht )dεht (15) εht

For volumetric choices, there is no closed form solution for integrating out εht and numerical methods are used to obtain expected demand by simulating a large number of realizations of ε and computing D for each. Expected demand is the average of D values. One way to compute D is to use a general-purpose algorithm for maximization, such as constrOptim in R. A more efficient algorithm is provided in the appendix.

2.4 Heterogeneity Data from a conjoint study can be characterized as being wide and shallow – i.e., wide in the sense that there may be 1000 or more respondents included in the study and shallow in the sense that each respondent provides, at most, about 8-16 responses to the choice task. This type of panel structure is commonly found in marketing applications involving consumer choices. We therefore employ hierarchical models, or a random-effect specification, to deal with the lack of data at the individual respondentlevel. The lower level of the hierarchical model applies the direct utility model to a specific respondent’s choice data, and the upper level of the model incorporates heterogeneity in respondent coefficients. Respondent heterogeneity can be incorporated into conjoint analysis using a variety of random-effect models. The simplest and most widely used is a Normal model for heterogeneity. Denoting all individual-level parameters θh for respondent h we have: θh ∼ Normal(θ¯ , )


    where θh = βh , γh , Eh , σh for the volumetric model, and θh = βh , βph for the discrete choice model. The price coefficient βph is referred to as (ψz ) in Eq. (5). Estimation of this model is easily done using modern Bayesian (Monte Carlo Markov chain) methods as discussed in Rossi et al. (2005). More complicated models for heterogeneity involve specifying more flexible distributions. One option is to specify a mixture of Normal distributions:  ϕk Normal(θ¯k , k ) (17) θh ∼ k

(here ϕk are the mixture probabilities) or to add covariates (z) to the mean of the heterogeneity distribution as in a regression model: θh ∼ Normal( zh , )




CHAPTER 3 Economic foundations of conjoint analysis

The parameters θ¯ , , and are referred to as hyper-parameters (τ ) because they describe the distribution of other parameters. Covariates in the above expression might include demographic variables or other variables collected as part of the conjoint survey, e.g., variables describing reasons to purchase. Alternatively, one could use a non-parametric distribution of heterogeneity as discussed in Rossi (2014). The advantage of employing Bayesian estimators for models of heterogeneity is that they provide access to individual-level parameters θh in addition to the hyper-parameters (τ ) that describe their distribution. It should be emphasized that the hyper-parameters τ can be shown to be consistently estimated as the sample size, or number of respondents, in the conjoint survey increases. However, the individual-level estimates of θh cannot be shown to be consistent because the number of conjoint question for any one respondent is constrained to be small. Therefore inferences for conjoint analysis should always be based on the hyper-parameters and not on the individual-level estimates.

2.5 Market-level predictions In order to make statements about a population of customers, we need to aggregate the results from the hierarchical model by integrating over the distribution heterogeneity. These statements involve quantities of interest Z, such as expected demand (xt ), or derived quantities based on the demand model such as willingness-to-pay, price elasticities, and associated confidence intervals. We can approximate the posterior distribution of Z by integrating out the hyper-parameters of the distribution of heterogeneity (τ ) and model error (εht ):

p (Z|θh , εht )p (θh |τ ) p (τ ) p(εht )dεht dθh dτ (19) p (Z|Data) = τ



The integration is usually done numerically. Posterior realizations of τ are available when using a Markov chain Monte Carlo estimation algorithm, and can be used to generate individual-level draws of θh . These draws, along with draws of the model error term, can then be used to evaluate posterior estimates of Z. The distribution of evaluations of Z can be viewed as its posterior distribution.

2.6 Indirect utility function The indirect utility function is the value function of the maximum attainable utility of a utility maximization problem. It is useful to define the indirect utility function for the volumetric demand case, as it is the basis for computing economic measures such as willingness-to-pay. Conditional on the characteristics of the choice alternatives A, parameters θh and a realization of the vector error terms εht , the indirect ∗ ): utility function is defined in terms of optimal demand (x∗ht , zht      ∗ V (pt , E|A, θh , εht ) = u x∗ht , zht |pt , θh , εht = u D θh , εht |A, pt


3 Measures of economic value

Eq. (20) can be evaluated by first determining optimal demand and then computing the corresponding value of the direct utility function (Eq. (7)). There is no closed form solution for the value function since there is no closed-form solution to D. Moreover, the error term εht and individual-level parameters (θh ) need to be integrated out to obtain an expression for expected indirect utility. We do this by using the structure of the heterogeneity distribution, where uncertainty in the hyper-parameters (τ ) induces variation on the distribution of individual-level parameters (θh ): V (pt , E|At )

= τ


u (D (θh , εht |At , pt ) p (θh |τ ) p (τ )) p(εht )dεht dθh dτ



3 Measures of economic value The economic value of a product feature requires an assessment of consumer welfare and its effect on marketplace demand and prices. Measuring the value of a product feature often begins by expressing part-worths (βh ) in monetary terms by dividing by the price coefficient, i.e., βh /βp . This monetization of utility is useful when summarizing results across respondents because utility, by itself, is not comparable across consumers. A monetary conversion of the part-worth of a product feature, however, is not sufficient for measuring economic value because it does not consider the effects of competitors in the market. A criticism of the simple monetization of utility is that consumers never have to pay the maximum they are willing in the marketplace. A billionaire may be willing to pay a large amount of money to attend a sporting event, but does not end up doing so because of the availability of tickets sold by those with lower WTP. Firms in a competitive market are therefore not able to capture all of the economic surplus of consumers. Consumers can switch to other providers and settle for alternatives that are less attractive but still worth buying. Below we investigate three measures of economic value that incorporate the effects of competing products. In theory, WTP can be the monetary value of an entire option or of a featurechange (or improvement) to a given product (Trajtenberg, 1989). As discussed below, WTP for an entire choice option involves the addition of a new model error term for the new product, while a new error term is not added when assessing the effects of a feature-change. The perspective in this chapter is focused on the feature change/improvement.

3.1 Willingness to pay (WTP) WTP is a demand-based estimate of monetary value. The simple monetization of utility to a dollar measure (e.g., βh /βhp ) does not correspond to a measure of consumer welfare unless there is only one good in the market and consumers are forced to select it. The presence of alternatives with non-zero choice probabilities means



CHAPTER 3 Economic foundations of conjoint analysis

that the maximum attainable utility from a transaction is affected by more than one good. Increasing the number of available choice alternatives increases the expected maximum utility a consumer can derive from a marketplace transaction, and ignoring the effect of competitive products leads to an misstatement of consumer welfare. Any measurement of the economic value of a product feature cannot be made in isolation of the set of available alternatives because it is not known, a priori, which product will be chosen (Lancsar and Savage, 2004). The evaluation of consumer welfare needs to account for private information held by the consumer at the time of choice. This information is represented as the error term in the model, whose value is not realized until the respondent is confronted with a choice. Consumer welfare is determined by the maximum attainable utility of a transaction and cannot be based on the likelihood, or probability that an alternative is utility maximizing. Choice probabilities do not indicate the value, or utility arising from a transaction. Welfare should be measured in terms of the expected maximum attainable utility, (E[max(u(.))]), where the maximization operator is taken over the choice alternatives and the expectation operator is taken over error realizations. The effect of competitive offers can be taken into account by considering respondent choices among all the choice alternatives in the conjoint study. Let A be a J × K matrix that define the set of products in a choice set, where J is the number of choice alternatives and K is the number of product features under study. The rows of the choice set matrix, aj indicates the features of the j th product in the choice set. Similarly, let A∗ be a choice matrix similar to A except that one of its rows is different indicating a different set of features for one of the products. Typically, just one element in the row differs when comparing A to A∗ because the WTP measure typically focuses on what respondents are willing to pay for an enhanced version of one of the attributes. The maximum attainable utility for a given choice set is defined in terms of the indirect utility function: V (p, E|A) = max u(x, z|A)

subject to


p x ≤ E


WTP is defined as the compensating value required to make the utility derived from the feature-poor set, A, equal to the utility derived from the feature-rich set, A∗ : V (p, E + WTP|A) = V (p, E|A∗ )


3.1.1 WTP for discrete choice The expected maximum attainable monetized utility, or consumer welfare, for a logit demand model can be shown to be equal to (Small and Rosen, 1981; Manski et al., 1981; Allenby et al., 2014b):       J  ∗  W A, p, θh = {βh , βhp } = E + ln exp βh aj − β hp pj (24) βhp j =1

This expression measures the benefit of spending some portion of the budgetary allotment, E, on one of the choice alternatives. As the attractiveness of the inside goods

3 Measures of economic value

declines, the consumer is more likely to select the outside good and save his or her money. Thus, the lower bound of consumer welfare confronted with an exchange is their budgetary allotment E. The improvement in welfare provided by a feature enhancement can be obtained as the difference of the maximum attainable utility of the enriched and original set of alternatives.    J  ∗  WTP = ln exp βh aj − β hp pj βhp j =1     J βhp − ln exp βh aj − β  hp pj (25) j =1

Ofek and Srinivasan (2002) define an alternative measure, which they refer to as the market value of an attribute improvement (MVAI) based on the change in price needed to restore the aggregate choice share of a good in a feature-rich (A∗ ) relative to a feature-poor (A) choice set. Their calculation, however, does not monetize the change in consumer welfare, or utility, gained by the feature enhancement because choice probabilities do not provide a measure of either. Moreover, the MVAI measure is not a “market” measure of value as there is no market equilibrium calculation. MVAI also only applies to continuous attributes. The gain in utility is a function of both the observable product features and the unobserved error realization that are jointly maximized in a marketplace transaction, thus requiring a consideration of the alternative choices available to a consumer because, prior to observing the choice, the values of the error realizations are not known. Our proposed WTP measure monetizes the expected improvement in the maximized utility that comes from feature-enhanced choices.

3.1.2 WTP for volumetric choice For the volumetric choice model, WTP is the additional budgetary allotment amount that is necessary for restoring the indirect utility of the feature-rich set given the feature-poor set. Conditional on realizations of εht and θh , WTP can be obtained by numerically solving Eq. (26) for WTP:   (26) V (p, E + WTP|A, θh , εht ) − V p, E|A∗ , θh , εht = 0 Computation of indirect utility V () has been described in Section 2.6. In the volumetric model, subjects are compensated for the loss in utility in a demand space that extends beyond unit demand. WTP therefore depends on purchase quantities, and are expected to be larger when the affected products have larger purchase quantities.

3.2 Willingness to buy (WTB) WTB is an alternative measure of economic value based on the expected increase in demand for an enhanced offering. It is similar to the measure proposed by Ofek and Srinivasan (2002) but does not attempt to monetize the change in share due to a feature enhancement. Instead, economic value is determined by calculating expected



CHAPTER 3 Economic foundations of conjoint analysis

increase in revenue or profit due to a feature enhancement, using WTB as an input to that calculation. The increase in demand due to the improved feature is calculated for one offering, holding fixed all of the other offerings in the market.

3.2.1 WTB for discrete choice In a discrete choice model, WTB is defined in terms of the change in market share that can be achieved by moving from a diminished to an improved feature set: WTB = MS(j |p, A∗ ) − MS(j |p, A)


It is computed as the increase in choice probability for each respondent in the survey, which is then averaged to produce an estimate of the change in aggregate market share. As mentioned above in Section 2.5, statements about the market level behavior require integration over uncertainty in model hyper-parameters (τ ), the resulting distribution of individual-level coefficients (θh ), and the model error (ε).

3.2.2 WTB for volumetric choice We can express WTB as a change in absolute sales or as a change in market share. Since there is no closed form for demand, D, we suggest first simulating a set of realizations of εht , then compute demand for the initial A and changed feature set A∗ conditional on the same set of εht and θh realizations. As shown in Section 2.3, we do this by numerically solving for D and evaluating the change in demand for the two feature sets:

  WTBsales = Dj θ, ε|A∗ , p − Dj (θ, ε|A, p) (28)

ε Dj (θ, ε|A∗ , p) Dj (θ, ε|A, p) − (29) WTBshare = ∗ D (θ, ε|A, p) ε D (θ, ε|A , p)

3.3 Economic price premium (EPP) The EPP is a measure of feature value that allows for competitive price reaction to a feature enhancement or introduction. An equilibrium is defined as a set of prices and accompanying market shares which satisfy the conditions specified by a particular concept of equilibrium. In our discussion below, we employ a standard Nash Equilibrium concept for differentiated products using a discrete choice model of demand, not the volumetric version of conjoint analysis. The calculation of an equilibrium price premium requires additional assumptions beyond those employed in a traditional, discrete choice conjoint study: • The demand specification is a standard heterogeneous logit that is linear in the attributes, including prices. • Constant marginal cost for the product. • Single product firms. i.e., each firm has just one offering. • Firms cannot enter or exit the market after product enhancement takes place. • Firms engage in a static Nash price competition.

4 Considerations in conjoint study design

The first assumption can be easily replaced by any valid demand system, including the volumetric demand model discussed earlier. One can also consider multi-product firms. The economic value of a product feature enhancement to a firm is the incremental profits that it will generate: π = π(p eq , meq |A∗ ) − π(p eq , meq |A)


where π are the profits associated with the equilibrium prices and shares given a set of competing products defined by the attribute matrix A. The EPP is the increase in profit maximizing prices of an enhanced product given some assumptions about costs and competitive offerings. Each product provider, one at a time, determines their optimal price given the prices of other products and the cost assumptions. This optimization is repeated for each provider until an equilibrium is reached where it is not in anyone’s interest to change prices any further. Equilibrium prices are computed for the offerings with features set at the lower level for all the attributes (A), and then recomputed for the offerings set to its higher level (A∗ ). EPP introduces the concept of price competition in the valuation of product features assuming static Nash price competition. In a discrete choice setting, firm profits is π(pj |p−j ) = MS(j |pj , p−j , A)(pj − cj )


where pj is the price of good j , p−j are the prices of other goods, and cj is the marginal cost of good j . The first-order conditions of the firm are: ∂ ∂π = MS(j |pj , p−j , A)(pj − cj ) + MS(j |pj , p−j , A) ∂pj ∂pj


The Nash equilibrium is a root of the system of equations defined by the first-order conditions for all J firms. If we define: ⎡ ⎤ ∂π h1 (p) = ∂p 1 ⎢ ⎥ ⎢ h2 (p) = ∂π ⎥ ∂p ⎢ 2 ⎥ h (p) = ⎢ (33) ⎥ .. ⎢ ⎥ . ⎣ ⎦ ∂π hJ (p) = ∂p J then the equilibrium price vector p∗ is a zero of the function h(p).

4 Considerations in conjoint study design Conjoint studies rely on general principles of survey design (Diamond, 2000) in collecting data for analysis. This includes the consideration of an appropriate sampling frame, screening questions, and conjoint design to provide valid data for analysis. Of



CHAPTER 3 Economic foundations of conjoint analysis

critical concern in designing a conjoint survey is to ensure that questions are posed in an unambiguous and easy-to-understand manner. Specifically, the conjoint attributes and their levels should be specified so as not to bias preference measurement and introduce unnecessary measurement error due to confusion or lack of understanding. Typically, conjoint design is informed by qualitative research prior to the drafting of the conjoint survey questionnaires. Questionnaires should also be pre-tested to determine if the survey is too long and if the information intended to be collected in the survey questions are understandable to survey respondents. For example, a survey designed to understand consumer demand for smartphones might include questions about product usage in form of data and text charges. Not all respondents may have a clear understand of the term ‘data,’ and this question might need to be phrased in terms of wording used in current advertisements (e.g., ‘talk’ and text). Records should be kept of both the qualitative and pre-testing phases of the survey design. In particular, revisions in the questionnaire should be documented. Modern conjoint studies are often administered using Internet panels where respondents are recruited to join a survey. Internet panel providers such as Survey Sampling International (https://www.surveysampling.com) or GfK (https://www.gfk. com) maintain a population of potential respondents who are emailed invitations to participate in a survey. Respondents usually do not know the exact nature of the survey (e.g., a survey about laundry detergent) which helps to reduce self-selection bias. Participants are then asked to complete a survey that typically begins with a series of demographic and screening questions to help establish the representativeness of the initial sample and to exclude respondents who do not qualify for inclusion in the survey. As discussed below, the general population is generally not the intended target of the survey because not all people have the interest or ability to answer the survey questions. Next is a series of questions that document attitudes, opinions and behaviors related to the focal product category. A glossary section follows in which attributes and attribute levels of the conjoint study are defined, following by the conjoint choice tasks. Additional classification variables for the analysis of subpopulations of interest are included at the end of the survey. We consider each of these survey components in detail below.

4.1 Demographic and screening questions The incidence of product usage varies widely across product categories, and screening questions help ensure that only qualified respondents are surveyed. Respondents in a conjoint study should represent a target market of potential buyers who have preexisting interests in the product category, often identified as people who have recently purchased in the product category and those who report they are considering a purchase in the near future. These individuals are referred to as “prospects” in marketing textbooks, defined as individuals with the willingness and ability to make purchases (Kotler, 2012). The purpose of the screening questions is to remove respondents who either lack the expertise to provide meaningful answers to the survey questions, or who are

4 Considerations in conjoint study design

not planning on making a purchase in the near future. The presence of screening questions creates some difficulty in assessing the representativeness of the survey sample. Demographic variables are available to profile the general population, but not prospects in a specific product category. Some product categories appeal to younger consumers and others to older consumers. Claiming that the resulting sample is demographically representative therefore relies on obtaining a representative sample prior to screening respondents out of the survey. There is a recent conjoint literature on incentive-alignment techniques used to induce greater effort among respondents so that their responses provider a ‘truer’ reflection of their actual preferences (Ding et al., 2005; Ding, 2007; Yang et al., 2018). The idea behind these techniques is that respondents will be more likely to provide thoughtful responses when offered an incentive that is derived from their choices, such a being awarded some version of the product under study that is predicted to give them high utility. In this literature, it is common to apply conjoint analysis to student samples in a lab setting. Improvements in predicting hold-out choices of these respondents are offered as evidence of increased external validity. However, a test of external validity needs to relate choice experiments to actual marketplace behavior. Shoppers in an actual marketplace setting may face time constraints and distractions after a busy workday that differ from a lab setting. Motivating lab subjects through incentive alignment does not necessarily lead to more realistic marketplace predictions and inferences. In industry-grade conjoint studies, respondents are screened so that conjoint surveys are relevant. Some panel providers can screen respondents based on past purchases from a given category to ensure familiarity with the product category. Respondents thus have the opportunity to help improve products relevant to them, so there is little reason to assume they would receive utility from lying. Moreover, it is often not possible to conduct incentive-aligned studies because Internet panel providers do not allow additional incentives to be offered to panel members. At worst, respondents may respond in a careless manner. Respondents with low in-sample model fit can be screened out of the sample as discussed in Allenby et al. (2014b), Section 6. We show below that conjoint estimates can be closely aligned with marketplace behavior without the use of incentive-alignment by screening for people who actively purchase in the product category.

4.2 Behavioral correlates Respondents participating in a conjoint study are often queried about their attitudes and opinions related to the product category. A study of smartphones would include questions about current payment plans, levels of activity (voice, text, Internet), and other electronic devices owned by the respondent. Questions might also involve details of their last purchase, competitive products that were considered, and specific product features that the respondent found attractive and/or frustrating to use. The purpose of collecting this data is two-fold: i) it encourages the respondent to think about the product and recall aspects of a recent or intended purchase that



CHAPTER 3 Economic foundations of conjoint analysis

are important to them; and ii) it provides an opportunity to establish the representativeness of the sample using data other than demographics. The behavioral correlates serve to ‘warm up’ the respondent so that they engage in the choice tasks with a specific frame of reference. It also provides information useful in exploring antecedents and consequences of product purchase through their relationship to estimates of the conjoint model. Behavioral covariates are often reported for products by various media, and these reports can be used as benchmarks for assessing sample representativeness. For example, trade associations and government agencies report on the number of products or devices owned by households, and the amount of time people spend engaged in various activities. There are also syndicated suppliers of data that can be used to assess product penetration, product market shares, and other relevant information to demand in the product category under study. Behavioral covariates are likely to be more predictive of product and attribute preferences than simple demographic information and, therefore, may be of great value in establishing representativeness.

4.3 Establishing representativeness In many instances, a survey is done for the purpose of projecting results to a larger target population. In particular, a conjoint survey is often designed to project demand estimates based on the survey sample to the larger population of prospective buyers. In addition, many of the inference methods used in the analysis of survey data are based on the assumption that the sample is representative or, at least, acquired via probability sampling methods. All sampling methods start with a target population, a sample frame (a particular enumeration of the set of possible respondents to be sampled from), and a sampling procedure. One way of achieving representativeness is to use sampling procedures that ensure representativeness by construction. For example, if we want to construct a representative sample of dentists in the US, we would obtain a list of licensed dentists (the sample frame) and use probability sampling methods to obtain our sample. In particular, simple random sampling (or equal probability of selection) would produce a representative sample with high probability. The only way in which random sampling would not work is if the sample size was small. Clearly, this approach is ideal for populations for which there are readily available sample frames. The only barrier to representativeness for random samples is potential nonresponse bias. In the example of sampling dentists (regarding a new dental product, for example), it would be relatively easy to construct a random sample but the survey response rate could be very low (less than 50 per cent). In these situations, there is the possibility of a non-response bias, i.e. that those who respond to the survey have different preferences than those who do not respond. There is no way to assess the magnitude of non-response bias except to field a survey with a higher response rate. This puts a premium on well-designed and short surveys and careful design of adequate incentive payments to reduce non-response. However, it is not always possible to employ probability-based sampling methods. Consider the problem of estimating the demand for a mid-level SUV prototype.

4 Considerations in conjoint study design

Here we use conjoint because this prototype is not yet in the market. Enumerating the target population of prospective customers is very difficult. One approach would be to start with a random sample of the US adult population and then screen this sample to only those who are in the market for a new SUV. The sample prior to screening is sometimes called the “inbound” sample. If this sample is representative, then clearly any resulting screened sample will also be representative, unless there is a high nonresponse rate to the screening questions. Of course, this approach relies on starting with representative sample of the US adult population. There are no sample frames for this population. Instead, modern conjoint practitioners use internet panels maintained by various suppliers. These are not random samples but, instead, represent the outcome of a continuous operation by the supplier to “harvest” the email addresses of those who are willing to take surveys. Internet panel providers often tout the “quality” of their panels but, frequently, this is not a statement about the representativeness of the sample but merely that the provider undertakes precautions to prevent fraud of various types including respondents who are bots or who live outside of the US. Statisticians will recognize that the internet panels offered commercially are what are called “convenience” samples and there is no assurance of representativeness. This means that it is incumbent on the researcher who uses an internet panel to establish representativeness by providing affirmative evidence of the representativeness of their sample. It is our view that, with adequate affirmative evidence, samples that are based on internet panels can be used as representative. Internet panel providers are aware of the problem of establishing representativeness and have adopted a variety of approaches and arguments. The first argument is that their panel may be very large (exceeding 1 million). The argument here is that this makes it more difficult for the internet panel to be skewed toward any particular subpopulation. This is not a very strong argument given that there are some 250 million US adults. Internet panels tend to be over-represented in older adults and under-representative of the extremes of the income and education distribution. To adjust for possible non-representativeness, internet panel providers use “click-balancing.” Internet panel members are surveyed at regular intervals regarding their basic demographic (and many other) characteristics. The practice of “click-balancing” is used to ensure that the “inbound” sample is representative by establishing quotas. For example, if census data establishes that the US adult population is 51 per cent female and 49 per cent male, then internet provider establishes quotas of male and female respondents. Once over quota, the internet provider rejects potential respondents. Typically, clickbalancing is only used to impose quotas for age, sex, and region, even though many internet providers have a wealth of other information which can be used to implement click-balancing. Statisticians will recognize this approach as quota sampling. Quota sampling cannot establish representativeness unless the quantities that are measured by the survey are highly correlated with whatever variables are used to balance. If I clickbalance only on age and gender, my conjoint demand estimates could be very non-



CHAPTER 3 Economic foundations of conjoint analysis

representative unless product and attribute preferences are highly correlated with age and gender. This is unlikely in any real world application. Our view is that one should measure a set of demographic variables that are most likely to be related to preference but also to measure a set of behavioral correlates. For example, we might want to include ownership of current make and model cars to establish a rough correspondence between our inbound sample and the overall market share by make or type of car available from sources such as JD Power. We might also look at residence type, family composition, and recreational activities as potential behavioral correlates for the SUV prototype sample. We could use our conjoint survey constrained to only existing products to simulate market shares which should be similar to the actual market shares for the set of products in the simulation. In short, we recognize that, for some general consumer products, probability samples are difficult to implement and that we must resort to the use of internet panels. However, we do not believe that click-balancing on a handful of demographic variables is sufficient to assert representativeness.

4.4 Glossary The glossary portion of a survey introduces the respondent to the product features and levels included in the choice tasks. It is important that product attributes are described factually, in simple terms, and not in terms of the benefits that might accrue to a consumer. Doing so would possibly educate the respondent about possible uses of the product that were not previously known and threaten the validity of the study. For example, a conjoint study of digital point-and-shoot camera might include the attribute “WiFi enabled.” Benefits associated with this attribute include easy downloading and storage of pictures, and the ability to quickly post photos on social media platforms such as Facebook or Instagram. However, not all respondents in a conjoint survey may use social media, and may not make the connection between the attribute “WiFi enabled” and benefits that accrue from its use. Moreover, there are many potential benefits of posting photos on social media, including telling others that you are safe, that you care about them, or that the photograph represents something you value. Including such a message is problematic for a conjoint study because the attribute description extends beyond a description of the product and introduces an instrumentation bias (Campbell and Stanley, 1963) into the survey instrument. The utility that respondents have for the attributes and features of a product depends on their level of knowledge of the attributes and how they can be useful to them. In some cases, firms may anticipate an advertising campaign to inform and educate consumers about the advantage of a particular attribute, and may want to include information on the benefits that can accrue from its use. However, incorporating such analysis into a conjoint study is problematic because it assumes that consumer attention and learning in the study is calibrated to levels of attention and learning in the marketplace. A challenge in constructing an effective glossary is getting respondents to understand differences among the levels of an attribute being investigated. A study of

4 Considerations in conjoint study design

FIGURE 1 Screenshot of luxury car seat glossary.

automotive luxury car seats, for example, may include upgrades such as power head restraints, upper-back support, and independent thigh support (Kim et al., 2017). Respondents need to pay careful attention to the glossary to understand exactly how these product features work and avoid substituting their own definitions. An effective method of accomplishing this is to produce a video in which the attributes are defined, and requiring respondents to watch the video before proceeding to the choice task. A screenshot explaining upper-back support is provided in Fig. 1.

4.5 Choice tasks The simplest case of conjoint analysis involves just two product features – brand and price – because every marketplace transaction involves these features. A brand-price analysis displays an array of choice alternatives (e.g., varieties of yogurt, different 12-packs of beer) and prices, and asks respondents to select the alternative that is most preferred. The purpose of a brand-price conjoint study is to understand the strength of brand preferences and its relationship to prices. That is, a brand-price analysis allows for the share prediction of existing offerings at different prices. It should be emphasized that a conjoint design with only brand and price is likely to produce unrealistic results as there are few products that can be characterized only two features, however important. The inclusion of non-price attributes to a conjoint study allows analysis to expand beyond brands and prices. An example of a choice task to study features of digit cameras is provided in Fig. 2 (Allenby et al., 2014a,b). Product attributes are listed



CHAPTER 3 Economic foundations of conjoint analysis

FIGURE 2 Example choice task.

on the left side of the figure, and attribute levels are provided in the cells of the feature grid. Brand and price are present along with the other product features. Also included on the right side of the grid is the ‘no-choice’ option, where respondents can indicate they would purchase none of the products described. The choice task illustrated in Fig. 2 illustrates a number of aspects of conjoint analysis. First, the choice task does not need to include all brands and offerings available in the marketplace. Just four brands of digital cameras are included in the digital camera choice task, with the remaining cameras available for purchase represented by the no-choice option. Second, the choice task also doesn’t need to include all the product features that are present in offerings. The brand name serves as a proxy for the unmentioned attributes of a brand in a conjoint study, and consumer knowledge of these features is what gives the brand name its value. For example, in a brandprice conjoint study, the brand name ‘Budweiser’ stands for a large number of taste attributes that would be difficult to enumerate in a glossary. Researchers have found that breaking the conjoint response into two parts results in more accurate predictions of market shares and expected demand (Brazell et al., 2006). The two-part, or dual response is illustrated in Fig. 3. The first part of the response asks the respondent to indicate their preferred choice option and the second part asks if the respondent would really purchase their most preferred option. The advantage of this two-part response is that it slows down the respondent so that they think through the purchase task, and results in a higher likelihood of a respondent selecting the no-choice option. Research has shown that this two-part response leads to more realistic predictions of market shares. Survey respondents are asked to express their preference across multiple choice tasks in a conjoint study. Statistical experimental design principles are used to make sure that the attribute levels are rotated across the choice options so that the partworths can be estimated from the data. Statistical experimental design principles are used to design the choice tasks so that the data are informative about the part-worths (Box et al., 1978).

4 Considerations in conjoint study design

FIGURE 3 Dual-response choice task.

The economic foundation of conjoint analysis rests on the assumption that consumers have well-defined preferences for offerings that are recalled when responding to the choice task. That is, consumer utility is not constructed based on the choice set, but is recalled from memory. This assumption is contested in the psychological choice literature (Lichtenstein and Slovic, 2006) where various framing effects have been documented (Roe et al., 2001). However, by screening respondents for inclusion in a conjoint study, the effects of behavioral artifacts of choice are largely reduced. Screened respondents, who are familiar with the product category and understand the product features, are more likely to have well-developed preferences and are less likely to construct their preferences at the time of decision. We discuss this issue again below when discussing the robustness of conjoint results. As discussed earlier, conjoint analysis can be conducted for decisions that involve volumetric purchases using utility functions that allow for the purchase of multiple goods. The responses can be non-zero for multiple choice alternatives corresponding to an interior solution to a constrained maximization problem. Fig. 4 provides an illustration of a volumetric choice task for the brand-price conjoint study discussed by Howell et al. (2015).

4.6 Timing data While samples based on internet panels are not necessarily representative, the internet format of survey research provides measurement capabilities that can be used to determine the validity of the data. In addition to timing the entire survey, researchers can measure the time spent reading and absorbing the glossary and on each of the choice tasks. This information can and should be gathered in the pre-test stage as well as the fielding of the final survey. The questionnaire can be reformulated if there are pervasive problems with attention or “speeding.” Sensitivity analyses with respect



CHAPTER 3 Economic foundations of conjoint analysis

FIGURE 4 Volumetric choice task.

to inclusion of respondents who appear to be giving the survey little attention are vital to establishing the credibility of results based on a conjoint survey which has been observed by many practitioners to be viewed by at least some respondents as tedious. Clearly, there is a limit to the amount of effort any one respondent is willing to devote to even the most well-crafted conjoint survey and conjoint researchers should bear this in mind before designing unnecessarily complex or difficult conjoint tasks.

4.7 Sample size For a simple estimator such as a sample mean or sample proportion, it is a relatively simple matter to undertake sample size computations. That is, with some idea of the variance of observations, sample sizes sufficient to reduce sampling error or posterior uncertainty below a specific margin can be determined. Typically, for estimation of sample proportions, sample sizes of 300-500 are considered adequate. However, a conjoint survey based on a valid economic formulation is designed to allow for inference regarding market demand and various other functions of demand such as equilibrium prices. In these contexts, there are no rules of thumb that can easily be applied to establish adequate sample sizes. In a full Bayesian analysis, the posterior distribution of any summary of the conjoint data, however complicated, can easily be constructed using the draws from the predictive posterior distribution of

5 Practices that compromise statistical and economic validity

the respondent level parameters (Eq. (19)). This posterior distribution can be used to assess the reliability or statistical information in a given conjoint sample. However, analytical expressions for these quantities are not typically not available. All we know is that the posterior distribution tightens (at least asymptotically) at the rate of the square root of the number of respondents. This does not help us plan sample sizes, in advance, for any specific function of demand parameters. The only solution to this problem is to perform a pilot study and scale the sample off of this study in such a way as to assure a given margin of error or posterior interval. √ This can be done by assuming that the posterior standard error will tighten at rate N. Our experience with equilibrium pricing applications is that sample sizes considerably larger than what is often viewed by conjoint practitioner as adequate are required. Many conjoint practitioners assume that a sample size of 500-1000 with a 10-12 conjoint tasks will be adequate. This may be on the low side for more demanding computations. We hasten to add that many conjoint practitioners do not report any measures of statistical reliability for the quantities that they estimate. Given the ease of constructing the posterior predictive distribution of any quantity of interest, there is no excuse for failure to report a measure of uncertainty.

5 Practices that compromise statistical and economic validity While conjoint originated as a method to estimate customer preferences or utility, many practitioners of conjoint have created alternative methodologies which either invalidate statistical inference or compromise the economic interpretation of conjoint results. As long as there are well-formulated and practical alternatives, it is our view that there is no excuse for using a method that is not statistically valid. Whether or not conjoint methods should be selected so that the results can be interpreted as measuring a valid economic model of demand is more controversial. A pure predictivist point of view is that any procedure is valid for predicting demand as long as it predicts well. This point of view opens the range of possible conjoint specifications to any specification that can predict well. Ultimately, conjoint is most useful in predicting or simulating demand in market configurations which differ from that observed in the real world. Here we are expressing our faith that specifications derived from valid utility structures will ultimately prevail in a prediction context where the state of the world is very different from that which currently prevails.

5.1 Statistical validity There are two threats to the statistical validity of conjoint analyses: 1) the use of estimation methods which have not been shown to provide consistent estimators and 2) improper methods for imposing constraints on conjoint part-worths.



CHAPTER 3 Economic foundations of conjoint analysis

5.1.1 Consistency A consistent estimator is an estimator whose sampling distribution collapses on the true parameter values as the sample size grows infinitely large. Clearly, this is a very minimal property. That is, there are consistent but very inefficient estimators. The important point is that when choosing an estimation procedure, we should only select from the set of procedures that are consistent. The only reason one might resort to using a procedure whose consistency has not been established is if there are no practical consistent alternatives. Even then, our view is that any analysis based on the procedure for which consistency cannot be verified is that this must be termed tentative at best. However, in conjoint, we do not have to resort to the use of unverified procedures since all Bayes procedures are consistent unless the prior is dogmatic in the sense of putting zero probability on a non-trivial portion of the parameter space. Moreover, Bayes procedures are admissible which means that it will be very difficult to find a procedure which dominates Bayes methods in either estimation or prediction (Bernardo and Smith, 2000). In spite of these arguments in favor of only using methods for which consistency can be demonstrated, there has been work in marketing on conjoint estimators (see Toubia et al., 2004) that propose estimators based on minimization of some criterion function. It is clear that, for a given criterion function, there is a minimum which will be an estimator. However, what is required to establish consistency is a proof that the criterion function used to derive the estimator converges (as the sample size grows to infinity) to a function with a minimum at the true parameter value. This cannot be established by sampling experiments alone as this convergence is required over the entire parameter space. The fact that an estimator works “well” in a few examples in finite samples does not mean that the estimator has good finite sample or asymptotic properties. We should note that there are two dimensions of conjoint panel data (N – number of respondents and T – the number of choice tasks). We are using large N and fixed T asymptotics in the definition of consistency. Estimates of respondent-level partworths are likely to be unreliable as they are based only on a handful of observations and an informative prior, and cannot be shown to be consistent if T is fixed. A common practice in conjoint analysis is to use respondent-level coefficients to predict the effects of changes to product attributes and price. An advantage of Bayesian methods of estimating conjoint models is its ability to provide individuallevel estimates of part-worths (θh ) in Eq. (16) in addition to the hyper-parameters that describe the distribution of part-worths (τ ). Using the individual-level estimates for predicting the effects of product changes on sales, however, is problematic because of the shallow nature of the data used in conjoint analysis. Respondents provide at most about 16 responses to the choice tasks before becoming fatigued, and while these estimates may be consistent in theory, they are not consistent in practice because of data limitations. The effect of using individual-level estimates is to under-state the confidence associated with predicted effects. That is, the confidence intervals are too large when

5 Practices that compromise statistical and economic validity

using the individual-level estimates. This is because uncertainty in the individuallevel estimates is due to two factors – uncertainty in the hyper-parameters and uncertainty arising from the individual-level data. As the sample size increases in a conjoint study, it is only possible to increase the number of respondents N , not the number of observations per respondent T . As a result, the individual-level estimates will always reflect a large degree of uncertainty, even when the hyper-parameters are accurately estimated. The accurate measurement of uncertainty in predictions from conjoint analysis must be based on model hyper-parameters as shown in Eq. (21).

5.1.2 Using improper procedures to impose constraints on partworths The data used in conjoint analysis can be characterized as shallow in the sense that there are few observations per respondent. Individual-level parameters (θh ) can therefore be imprecisely estimated and can violate standard economic assumptions. Price coefficients, for example, are expected to be negative in that people should want to pay less for a good than more, and attribute-levels may correspond to an ordering where consumers should want more of a feature or attribute holding all else constant. There are two approaches for introducing this prior information into the analysis. The first is to reparameterize the likelihood so that algebraic restrictions on coefficients are enforced. For example, the price coefficient in Eq. (5) can be forced to be negative through the transformation: ψz = − exp(βp ) and estimating βp unrestricted. Alternatively, constraints can be introduced through the prior distribution as discussed by Allenby et al. (1995). Sign constraints as well as monotonicity can be imposed automatically using our R package, bayesm. It is a common practice in conjoint analysis to impose sign restrictions by simply zeroing out the estimates or to use various “tying” schemes for ordinal constraints in which estimates are arbitrarily set equal to other estimates to enforce the ordinal constraints. This is an incoherent practice which violates Bayes theorem and, therefore, removes the desirable properties of the Bayes procedures.

5.2 Economic validity There are four threats to the economic validity of conjoint analyses: 1. 2. 3. 4.

Using conjoint specifications contrary to valid utility or indirect utility functions. Various sorts of self-explicated conjoint which violate utility theory. Comparison of part-worths across respondents. Attempts to combine conjoint and rankings data.

5.2.1 Non-economic conjoint specifications We have emphasized that conjoint specifications should be derived from a valid direct utility function (see (1) or (7)). For discrete-choice conjoint, the respondent choice probabilities are a standard logit function of a linear index where prices enter in



CHAPTER 3 Economic foundations of conjoint analysis

levels not in logs. It is common for practitioners to enter prices as a sequence of dummy variables, each for a different level of price used in the conjoint design. The common explanation is that the dummy variable specification is “non-parametric.” This approach is not only difficult to reconcile with economic theory but also opens the investigator for a host of violations of the reasonable economic assumption that indirect utility is monotone in price (not to mention convex). In general, utility theory imposes a certain discipline on investigators to derive the empirical specification from a valid utility function. In the volumetric context, the problems are even worse as the conjoint specifications are often completely ad hoc. This ad-hocery arises, in part, from the lack of software to implement an economically valid volumetric conjoint – a state of affairs we trying to remedy by our new R package, echoice.5

5.2.2 Self-explicated conjoint Some researchers advocate using self-explicated methods of estimating part-worths (see, for example, Srinivasan and Park, 1997; Netzer and Srinivasan, 2011). In some forms of self-explicated conjoint, respondents are asked to rate the relative importance of a product feature on some sort of integer valued scale, typically 5 or 7 points. In other versions of self-explicated conjoint, separate measures of the relative importance and desirability of product attributes are combined in an arbitrary way to form an estimate of a part-worth. There are many ways in which these procedures violate both economic and statistical principles. Outside of the demand context (as in choice-based or volumetric conjoint), there is no meaning to the “importance” or “desirability” of features. The whole point of a demand experiment is to infer a valid utility function from demand responses. No one knows what either “importance” or “desirability” means, including the respondents. The scales used are only ordinal and therefore cannot be converted to a utility scale which is an interval scale. In short, there is no way to transform or convert relative importance or desirability to a valid part-worth. Finally, as there is no likelihood function (or error term) in these models, it is impossible to analyze the statistical properties of self-explicated procedures.

5.2.3 Comparing raw part-worths across respondents Conjoint part-worths provide a measure of the marginal utility associated with changes in the levels of product attributes. The utility measure obtained from a conjoint analysis allows for the relative assessment of changes in the product attributelevels, including price. However, utility is not a measure that is comparable across respondents because it is only intended to reflect the preference ordering of a respondent, and a preference ordering can be reflected by any monotonic transformation of the utility scale. That is, the increase or decrease in utility associated with changes in the attribute-levels are person-specific, and cannot be used to make statements that one respondent values changes in the levels of an attribute more than another.

5 Development version available at https://bitbucket.org/ninohardt/echoice/.

6 Comparing conjoint and transaction data

Making utility comparisons across respondents requires the monetization of utility to express the value of changes in the levels of an attribute on a scale common to all respondents. While the pain or gain of changes to a product attribute is not comparable across people, the amount they would be willing to pay is comparable across respondents. The WTP, WTB, and EPP measures discussed above provide a coherent metric for summarizing the results of conjoint analysis across respondents.

5.2.4 Combining conjoint with other data One could argue that there are covariates predictive of preferences that can be measured outside the conjoint portion of the survey. The proper way to combine this data with conjoint choice or demand data is to employ a hierarchical model specification in which individual-level parameters are related to each other in the upper-level of the model hierarchy. The upper-level model in conjoint analysis typically is used to describe cross-sectional variation in part-worth estimates using these covariates to model observed heterogeneity. Individual-level coefficients from other models, calibrated on other datasets, could be used as variables to describe the cross-sectional variation of part-worths (Dotson et al., 2018). Combining data in this way automatically weights the datasets and can improve the precision of the part-worth estimates. A disturbing practice is the combination of conjoint choice data with Max-Diff rankings data. Practitioners are well aware that conjoint surveys are not popular with respondents but that a standard Max-Diff exercise is found to be very easy for respondents. Max-Diff is a procedure to rank (by relative importance) any set of attributes or products. The Max-Diff procedure breaks the task of ranking all in the set down into a sequence of smaller and more manageable tasks which consist of picking the most and least “important” from a small set of alternatives. A logit-style model can be used to analyze Max-Diff data to provide a ranking for each respondent. It is vital to understand that rankings are not utility weights and rankings only have ordinal properties. The exact coefficients used to implement the ranking are irrelevant and have no value. That is to say, the ranking of 3 things represented by (1, 3, 2) is the same as (10, 21, 11). There is no meaning to the intervals separating values or to ratios of values. This is true even setting aside the thorny question of what “importance” means. There are no trade-offs in Max-Diff analysis, so there is no way to predict choice or demand behavior on the basis of the Max-Diff derived rankings. Unfortunately, some practitioners have taken to scaling Max-Diff ranking coefficients and interpreting these scaled coefficients as part-worths. They are not. There is no way of combining Max-Diff and conjoint data in a coherent fashion. The only way to do so is the regard “importance” as the same as utility and to use the Max-Diff results as the basis of an informative prior used in analysis of conjoint data.

6 Comparing conjoint and transaction data The purpose of conducting a conjoint analysis is to predict changes in demand for changes in a product’s configuration and its price. Conjoint analysis provides an ex-



CHAPTER 3 Economic foundations of conjoint analysis

perimental setting for exploring these changes when revealed preference data from the marketplace lacks sufficient variation for making inferences and predictions. A natural question to ask is the degree to which conjoint analysis provides reliable estimates of changes that would occur. In this chapter we have argued that there are many requirements of conducting a valid conjoint study, beginning with the use of a valid economic model for conducting inference and the use of rigorous methods in data collection to make sure the respondent is qualified to provide answers and understands the choice task. We investigate the consistency of preferences and implications between conjoint and transaction data using a dataset containing marketplace transactions and conjoint responses from the same panelists. Frequent buyers in the frozen pizza product category were recruited to provide responses to a conjoint study in which frozen pizza attribute-levels changed. The fact that participants were frequent purchasers in the product category made them ideal respondents for the conjoint survey in the sense that they are known to be prospects who were well acquainted with the category. Most attempts to reconcile results from stated and revealed preferences data have tended to focus on aggregate predictors of demand and market shares. Examples range from the study of vegetables (Dickie et al., 1987) and grocery items (Burke et al., 1992; Ranjan et al., 2017) that are frequently purchased, to infrequently purchased goods such as automobiles (Brownstone et al., 2000). Lancsar and Swait (2014) provide an overview of studies across different disciplines. We start by investigating the consistency of estimated parameters, including estimates of marginal utility. We then assess the extent to which market demand predictions can be made using conjoint-based estimates. Finally, we show estimates of measures of economic value.

6.1 Preference estimates The conjoint study choice task involved six frozen pizza offerings and included a ‘no-choice’ option. The transaction data comprised 103 unique product offerings and included attributes such as brand name (e.g., DiGiorno, Red Baron, and Tombstone), crust (i.e., thin, traditional, stuffed, rising), and toppings (e.g., pepperoni, cheese, supreme). The volumetric demand model arising from non-linear utility (Eq. (7)) was used to estimate the model parameters because pizza purchases typically involve the purchase of multiple units and varieties. Table 1 provides a list of product attributes. The attributes and attribute-levels describing the 103 UPCs were used to design the conjoint study. Among the 297 households in an initial transaction dataset, 181 households responded to a volumetric conjoint experiment and had more than 5 transactions in the 2-year period. Qualifying respondents in this way ensures that they are knowledgeable about the category and typical attribute levels. In each of the 12 choice tasks, respondents choose how many units of each of the 6 product alternatives they would purchase the next time they are looking to buy frozen pizza. A sample choice

6 Comparing conjoint and transaction data

Table 1 Attributes. Brand

Size for Crust

One Thin DiGiorno Frescetta (Fr) Two (FT) Traditional (TC) Red Baron (RB) Stuffed (SC) Private Label (Pr) Rising (RC) Tombstone (Tm) Tony’s (Tn)

Topping type Pepperoni Cheese (C) Vegetarian (V) Surpreme (Sr) PepSauHam (PS) Hawaii (HI)

Topping spread Moderate Dense (DT)

Cheese No claim Real (RC)

FIGURE 5 Choice task.

task is shown in Fig. 5. Price levels were chosen in collaboration with the sponsoring company to mimic the actual price range.



CHAPTER 3 Economic foundations of conjoint analysis

Table 2 Estimated parameters (volumetric model). Mean of random-effects dis¯ tribution θ. β0

Conjoint −2.91 (0.11)

Transaction −4.66 (0.18)


Frescetta Red Baron Private Label Tombstone Tony’s

−0.35 (0.09) −0.66 (0.10) −0.72 (0.10) −0.80 (0.11) −1.29 (0.13)

−0.19 (0.11) −0.38 (0.12) −0.85 (0.15) −0.63 (0.16) −2.05 (0.17)


Serves two

0.64 (0.07)

1.22 (0.08)


Traditional Stuffed Rising

0.11 (0.07) −0.04 (0.08) 0.07 (0.08)

0.31 (0.07) −0.04 (0.15) 0.26 (0.07)

Topping Type

Cheese Vegetarian Surpreme PepSauHam Hawaii

−0.40 (0.12) −1.12 (0.16) −0.30 (0.11) −0.13 (0.09) −1.14 (0.15)

−0.57 (0.09) −0.96 (0.14) −0.39 (0.09) −0.22 (0.08) −0.99 (0.21)



0.06 (0.05)

−0.02 (0.06)



0.10 (0.05)

−0.01 (0.10)

ln γ ln E ln σ

−0.50 (0.08) 3.57 (0.07) −0.57 (0.05)

−2.07 (0.09) 2.89 (0.06) −0.87 (0.05)

Boldfaced parameters signify that the 95% posterior credible interval of the estimate does not include zero. Standard Deviations printed in parentheses.

We use dummy coding in our model estimation, where the first level of each attribute in Table 1 are the reference levels. The vector of ‘part-worths’ β includes a baseline coefficient β0 , which represents the value of an inside good relative to the outside good. The remaining elements of β refer to the dummy coefficients. We use the volumetric demand model described in Section 2.2. Individual-level parameters are given by the vector of ‘part-worths’ β, the rate of satiation of inside goods γ , the alloted budget E, and the scale of the error term σ . The latter three parameters are log-transformed to ensure positivity. A multivariate normal distribution of heterogeneity is assumed with default (diffuse) priors. Parameter estimates for the conjoint and transaction data are provided in Table 2. The left side of the table reports estimates from the conjoint data, and estimates from the transaction data are displayed on the right side of the table. All estimates are based on models with Type 1 extreme value error terms.

6 Comparing conjoint and transaction data

FIGURE 6 Comparison of part-worths (volumetric model).

We find general agreement among the estimates except for the inside good intercept (β0 ), the estimated rate of satiation (γ ), the budgetary allotment (E), and the scale of the error terms (σ ). That is, the relative value of the brand names and product attribute-levels are approximately the same when estimated with either conjoint or transaction data. Fig. 6 provides a plot of the mean of the random-effects distribution for the two datasets. Average part-worth estimates are plotted close to the 45 degree line indicating that the estimates are similar. There are a number of potential reasons for the difference in estimates of the brand intercept (β0 ), satiation parameter (γ ), and the budgetary allotment (E). Respondents in the conjoint task were told to consider their choices the next time they were buying frozen pizza, while data from the transaction data conditioned on some purchase in the frozen pizza category. Most respondents in the dataset purchased frozen pizza less than 10 times over a two year period, and including data in the analysis from shopping occasions in which they did not make a frozen pizza purchase implicitly assumes that they were in the market for pizza on each and every occasion. We therefore excluded shopping occasions from the transaction data in which pizza wasn’t purchased so that it would mimic the conjoint data where respondents were told to consider their next purchase occasion. There is no way of knowing for certain when shoppers considered making a frozen pizza purchase, but did not, in the revealed preference (transaction) data and this discrepancy is partly responsible for the difference in the brand intercept (β0 ) estimates. Conjoint data cannot account for variation in the context of purchase and consumption, and this limitation may lead to differences in estimates of satiation (γ ) and budgetary allotment (E). For example, frozen pizza may occasionally be purchased for social gatherings, which may not be taken into account when providing conjoint responses, resulting in an estimate of the budgetary allotment that is too high for the



CHAPTER 3 Economic foundations of conjoint analysis

FIGURE 7 Comparison of monetized part-worths (discrete choice model).

typical conjoint transaction. The over-estimation of E may also affect the estimate of the rate of satiation. Another consequence of the hypothetical nature of conjoint tasks is that respondents may apply a larger budget when making allocations because they aren’t actually spending their own money. We find that the conjoint data reflect lesser satiation and a greater budgetary data than that found in revealed preference data. We also investigate a discrete choice approximation to the demand data by ‘exploding’ the volumetric data to represent a series of discrete choices. When q units of a good are purchased, it is interpreted as q transactions with a discrete choice for that good. This allows for the estimation of a discrete choice model on the volumetric conjoint and transaction data. However, this practice results in no consumption of an ‘outside good’ in the transaction data because only nonzero quantities are observed. We estimate a hierarchical Logit model with a multivariate normal distribution of heterogeneity and relatively diffuse priors. The price coefficient βp is re-parameterized to ensure negativity of the price coefficient. Estimated coefficients are shown in Table 3. Monetized estimates of the partworths, obtained by dividing the part-worth by the price coefficient, are compared in Fig. 7, where we use the corresponding means of the random effects distribution. We find close agreement of part-worth estimates from the transaction and conjoint data.

6.2 Marketplace predictions We next compare marketplace predictions from conjoint and transaction data by aggregating across the 103 UPCs to obtain brand-level demand estimates. Parameter estimates from the volumetric conjoint and transaction data are used to produce two

6 Comparing conjoint and transaction data

Table 3 Estimated parameters (discrete choice model). Mean of random-effects ¯ distribution θ. Conjoint 3.67 (0.31)


β0 Brand

Frescetta Red Baron Private Label Tombstone Tony’s

−0.25 (0.11) −0.60 (0.11) −0.73 (0.11) −0.60 (0.11) −0.80 (0.13)

−0.58 (0.23) −1.61 (0.27) −2.90 (0.38) −2.22 (0.43) −5.45 (0.47)


Serves two

0.28 (0.07)

3.63 (0.30)


Traditional Stuffed Rising

0.04 (0.08) −0.01 (0.08) −0.03 (0.08)

0.59 (0.15) 0.45 (0.25) 0.71 (0.16)

Topping type

Cheese Vegetarian Surpreme PepSauHam Hawaii

−0.22 (0.12) −0.52 (0.13) −0.12 (0.11) −0.06 (0.09) −0.38 (0.13)

−1.33 (0.21) −2.09 (0.35) −0.96 (0.24) −0.46 (0.19) −1.60 (0.24)



0.02 (0.06)

−0.03 (0.11)



0.09 (0.06)

−0.16 (0.25)

ln βp

−1.93 (0.20)

−0.33 (0.10)

Boldfaced parameters signify that the 95% posterior credible interval of the estimate does not include zero. Standard Deviations printed in parentheses.

forecasts for each brand that are displayed in Fig. 8 for the volumetric demand model and Fig. 9 for the logit model with ‘exploded’ choices. We find that the demand curves are roughly parallel in Fig. 8 with predictions from the conjoint data consistently higher than that based on the transaction data. The reason for the shift is due to differences in the estimate of the brand coefficient (β0 ) which we attribute to differences in the treatment of the ‘no-choice’ option. That is, the smaller estimated brand intercept in the transaction data results in lower estimates of demand. While the level of demand is estimated to be different in Fig. 8, changes in demand for changes in price and other product attributes is approximately the same because the demand curves are roughly parallel to each other and because the part-worth coefficients enter the Kuhn-Tucker conditions in Eq. (12) linearly. The consistency of the estimates from conjoint and transaction data is observed more readily observed when we convert the volumetric predictions to shares in Fig. 9. It is useful to remember that the purpose of conjoint analysis is to predict changes in marketplace demand as a result of changes in the formulation of marketplace of-



CHAPTER 3 Economic foundations of conjoint analysis

FIGURE 8 Comparison of predictions (volumetric model).

FIGURE 9 Comparison of predictions converted to shares (volumetric model).

ferings. Even though the aggregate demand curves displayed in Figs. 8 and 9 are seen to be vertically translated, predictions of the change in volume and change in share are closer to each other.

6.3 Comparison of willingness-to-pay (WTP) We compute the consumer willingness-to-pay (WTP) for the family-sized (‘for two’) attribute of a DiGiorno pepperoni pizza with rising crust and a real cheese claimed

6 Comparing conjoint and transaction data

Table 4 Willingness-to-pay estimates (volumetric and logit), ‘for-two’ attribute.

0.034 0.029 −0.009 0.074

WTP Transaction With β0trans 0.027 0.026 0.021 0.031

0.113 0.070 0.020 0.144

0.013 0.002 0.000 0.015

Conjoint Logit

mean median perc25 perc75

Volumetric mean median perc25 perc75

p-WTP Conjoint Transaction 0.782 0.776 0.704 0.859

3.315 3.278 3.185 3.430

0.023a 0.009a 0.001a 0.026a


Volumetric WTP estimates based on conjoint data except for brand intercepts, which are substituted from transaction data estimates.

attribute. As discussed above, a true WTP measure includes the effects of alternative products that consumers could consider as alternatives if a particular attribute is unavailable, and is not simply a monetary rescaling of the ‘for two’ part-worth. Ignoring the effect of competitive offerings will over-state the value of product attributes because it ignores the other options consumers have available to them as they make their purchases. We compare estimates of pseudo willingness-to-pay (p-WTP) based on partworth monetization (i.e., βh /βhp ), to estimates of willingness-to-pay (WTP) based on compensating valuation calculations (see Eqs. (25) and (26)). Table 4 shows estimates for the Logit and Volumetric demand models, using parameter estimates based on conjoint and transaction data. The top portion of Table 4 reports results for the logit model with ‘exploded’ data, and the bottom portion of the table pertains to the volumetric demand model. WTP estimates are reported on the left side of the table, and pseudo-WTP estimated are reported on the right side. We find that the WTP estimates for the logit model are much smaller than p-WTP estimates because of the large number of choice alternatives present in the marketplace. The absence of a ‘for-two’ DiGiorno pepperoni pizza creates an economic loss that is worth, on average about three cents using the WTP statistic as opposed to either 78 cents or $3.31 based on the p-WTP estimate. The loss in utility is much smaller in the WTP calculation because consumers can recognize that they can purchase a different option to generate their utility and are not constrained to purchase the same good. Moreover, we find that estimates of WTP for the logit model is about three cents using conjoint estimates of the part-worths, and about two cents using the transaction data estimates. These estimates are not statistically different from each other. WTP estimates based on the volumetric demand model are less consistent, with estimates based on the conjoint data equal to eleven cents versus one cent for the transaction data. This difference is due to differences in the estimated baseline in-



CHAPTER 3 Economic foundations of conjoint analysis

tercept coefficient (β0 ). When the transaction data intercept is substituted for the conjoint intercept, the estimated WTP for the conjoint data reduces from eleven cents to two cents. Thus, overall, we find that estimates based on the conjoint data are slightly higher, but not significantly higher, than those based on the transaction data for both models once the difference is the baseline intercept is aligned.

7 Concluding remarks Conjoint analysis is an indispensable tool for predicting the effects of changes to marketplace offerings when observational data does not exist to inform an analysis. This occurs when predicting sales of new products and product configurations with new attribute-levels and their combinations and when existing products have little price variation. Conjoint data reflects simulated choices among a set of competitive products for which respondents are assumed to have well-defined preferences. Analysis is based on a combination of economic and statistical principles that afford inferences and predictions about the expected sales of new offerings. We argue in this chapter that a valid conjoint analysis requires a valid model and constructs for inference and valid data. We discuss two economic models for analysis based on random-utility theory for discrete choice and volumetric demand, and discuss alternative measures (WTP, WTB, and EPP) of economic value. We demonstrate that these measures are consistently estimated using either conjoint or transaction data using a conjoint analysis conducted on scanner panelists in the frozen pizza category. Part-worth estimates of product features are shown to be approximately the same, and forecasts of the change in demand for changes in attributes such as price are found to be similar. Some model parameters, however, are not consistently estimated across stated and revealed preference data. The largest discrepancy involves the baseline brand intercept (β0 ) for the no-choice option, which is difficult to align, because conjoint studies ask respondents to think about the next time they will be making a purchase in the category while revealed preference cannot exactly identify these shopping trips. Shopping trips limited to purchases in the category ignore occasions when shoppers may be in the market but decide not to purchase because prices are not sufficiently low, and the collection of all shopping trips to the store contain instances where consumers are not in the market for items in the category. The difference in the estimated baseline intercept results in a vertical translation of demand curves and heightened measures of economic willingness-to-pay. This chapter demonstrates that a properly designed conjoint study, using valid economic models to analyze the data, can produce accurate estimates of economic demand in the marketplace. We identify practices that should be avoided, and demonstrate that ad-hoc estimates of value, such as the pseudo willingness-to-pay (p-WTP), provide poor estimates of the economic value of product features. Additional research is needed to better design conjoint studies to obtain valid estimates of budgetary allotments and the rate of satiation of purchase quantities.

Technical appendix

Technical appendix: Computing expected demand for volumetric conjoint The algorithm described here first determines the optimal amount of the outside good zht , which then allows computing the corresponding inside good quantities xht . From Eqs. (8) and (9) we have that: pj = uj z pj ≥ uj z

if xj > 0


if xj = 0


At the optimum, ui /pi = uj /pj for the R goods with non-zero demand. Solving for x yields an equation for optimal demand quantities for the inside goods: xk =

ψ k z − pk γ pk


Substituting Eq. (A.3) into the budget constraint (2) yields:  γ +E R k=1 pk z= if R > 0 R γ + k=1 ψk z=E

if R = 0

(A.4) (A.5)

Re-arranging (A.1) yields the following for z: pj ψj pj zs ≤ ψj

zs =

if xj > 0


if xj = 0


where s=

1 γ xj + 1

The algorithm needs R iterations to complete. At each step k, we compute the corresponding quantity xk and z, as if R = k. Then checking Eqs. (A.6) and (A.7) will determine if the breakpoint has been reached. To implement this, let: pi for 1 ≤ i ≤ K ψi ρ0 = 0 ρK+1 = ∞ ρi =

(A.8) (A.9) (A.10)

and order the values ρi in ascending order so that ρi ≤ ρi+1 for 1 ≤ i ≤ K. Then, z > ρk implies z > ρi for i ≤ k. At the optimum, xi > 0 for 1 ≤ k ≤ K, xi = 0 for



CHAPTER 3 Economic foundations of conjoint analysis

k < i ≤ K, and ρk < z < ρk+1 . The algorithm is guaranteed to stop at optimal z and 0 ≤ k ≤ K. The steps are as follows: 1. a ←− γ E, b ←− γ , k ←− 0 2. z ←− a/b 3. while z ≤ ρk or z > ρk+1 : (a) k ←− k + 1 (b) a ←− a + ρk (c) b ←− b + ψk (d) z ←− a/b Once the algorithm terminates, we can insert optimal z into Eq. (A.3) to compute the optimal inside good quantities x.

References Allenby, Greg M., Arora, Neeraj, Ginter, James L., 1995. Incorporating prior knowledge into the analysis of conjoint studies. Journal of Marketing Research, 152–162. Allenby, Greg M., Brazell, Jeff, Howell, John R., Rossi, Peter E., 2014a. Valuation of patented product features. The Journal of Law and Economics 57 (3), 629–663. Allenby, Greg M., Brazell, Jeff D., Howell, John R., Rossi, Peter E., 2014b. Economic valuation of product features. Quantitative Marketing and Economics 12 (4), 421–456. Allenby, Greg M., Kim, Jaehwan, Rossi, Peter E., 2017. Economic models of choice. In: Handbook of Marketing Decision Models. Springer, pp. 199–222. Allenby, Greg M., Rossi, Peter E., 1991. Quality perceptions and asymmetric switching between brands. Marketing Science 10 (3), 185–204. Allenby, Greg M., Rossi, Peter E., 1998. Marketing models of consumer heterogeneity. Journal of Econometrics 89 (1–2), 57–78. Becker, Gordon M., DeGroot, Morris H., Marschak, Jacob, 1964. Measuring utility by a single-response sequential method. Behavioral Science 9 (3), 226–232. Bernardo, José M., Smith, Adrian F.M., 2000. Bayesian Theory. Wiley. Berry, Steven, Levinsohn, James, Pakes, Ariel, 1995. Automobile prices in market equilibrium. Econometrica, 841–890. Box, George E.P., Hunter, William Gordon, Hunter, J. Stuart, et al., 1978. Statistics for Experimenters. John Wiley and Sons, New York. Brazell, Jeff D., Diener, Christopher G., Karniouchina, Ekaterina, Moore, William L., Séverin, Válerie, Uldry, Pierre-Francois, 2006. The no-choice option and dual response choice designs. Marketing Letters 17 (4), 255–268. Brownstone, David, Bunch, David S., Train, Kenneth, 2000. Joint mixed logit models of stated and revealed preferences for alternative-fuel vehicles. Transportation Research Part B: Methodological 34 (5), 315–338. Burke, Raymond R., Harlam, Bari A., Kahn, Barbara E., Lodish, Leonard M., 1992. Comparing dynamic consumer choice in real and computer-simulated environments. Journal of Consumer Research 19 (1), 71–82. Campbell, Donald T., Stanley, Julian C., 1963. Experimental and Quasi-Experimental Designs for Research. Houghton Mifflin Company, Boston. Diamond, Shari Seidman, 2000. Reference guide on survey research. In: Reference Manual on Scientific Evidence, pp. 221–228. Dickie, Mark, Fisher, Ann, Gerking, Shelby, 1987. Market transactions and hypothetical demand data: a comparative study. Journal of the American Statistical Association 82 (397), 69–75.


Ding, Min, 2007. An incentive-aligned mechanism for conjoint analysis. Journal of Marketing Research 44 (2), 214–223. Ding, Min, Grewal, Rajdeep, Liechty, John, 2005. Incentive-aligned conjoint analysis. Journal of Marketing Research 42 (1), 67–82. Dotson, Marc, Büschken, Joachim, Allenby, Greg, 2018. Explaining preference heterogeneity with mixed membership modeling. https://doi.org/10.2139/ssrn.2758644. Green, Paul E., Rao, Vithala R., 1971. Conjoint measurement for quantifying judgmental data. Journal of Marketing Research, 355–363. Hardy, Melissa A., 1993. Regression with Dummy Variables, vol. 93. Sage. Howell, John R., Lee, Sanghak, Allenby, Greg M., 2015. Price promotions in choice models. Marketing Science 35 (2), 319–334. Kim, Dong Soo, Bailey, Roger A., Hardt, Nino, Allenby, Greg M., 2016. Benefit-based conjoint analysis. Marketing Science 36 (1), 54–69. Kim, Hyowon, Kim, Dongsoo, Allenby, Greg M., 2017. Benefit Formation and Enhancement. Working paper. Fisher College of Business, The Ohio State University. Kotler, Philip, 2012. Kotler on Marketing. Simon and Schuster. Lancsar, Emily, Savage, Elizabeth, 2004. Deriving welfare measures from discrete choice experiments: inconsistency between current methods and random utility and welfare theory. Health Economics 13 (9), 901–907. Lancsar, Emily, Swait, Joffre, 2014. Reconceptualising the external validity of discrete choice experiments. PharmacoEconomics 32 (10), 951–965. Lichtenstein, Sarah, Slovic, Paul, 2006. The Construction of Preference. Cambridge University Press. Louviere, Jordan J., Hensher, David A., 1983. Using discrete choice models with experimental design data to forecast consumer demand for a unique cultural event. Journal of Consumer Research 10 (3), 348–361. Luce, R. Duncan, Tukey, John W., 1964. Simultaneous conjoint measurement: a new type of fundamental measurement. Journal of Mathematical Psychology 1 (1), 1–27. Manski, Charles F., McFadden, Daniel, et al., 1981. Structural Analysis of Discrete Data with Econometric Applications. MIT Press, Cambridge, MA. Matzkin, Rosa L., 1991. Semiparametric estimation of monotone and concave utility functions for polychotomous choice models. Econometrica, 1315–1327. Matzkin, Rosa L., 1993. Nonparametric identification and estimation of polychotomous choice models. Journal of Econometrics 58 (1–2), 137–168. Netzer, Oded, Srinivasan, Visvanathan, 2011. Adaptive self-explication of multiattribute preferences. Journal of Marketing Research 48 (1), 140–156. Ofek, Elie, Srinivasan, Venkataraman, 2002. How much does the market value an improvement in a product attribute? Marketing Science 21 (4), 398–411. Orme, Bryan K., 2010. Getting Started with Conjoint Analysis: Strategies for Product Design and Pricing Research. Research Publishers. Pachali, Max J., Kurz, Peter, Otter, Thomas, 2017. The perils of ignoring the budget constraint in singleunit demand models. https://doi.org/10.2139/ssrn.3044553. Ranjan, Bhoomija, Lovett, Mitchell J., Ellickson, Paul B., 2017. Product launches with new attributes: a hybrid conjoint-consumer panel technique for estimating demand. https://doi.org/10.2139/ssrn. 3045379. Roe, Robert M., Busemeyer, Jermone R., Townsend, James T., 2001. Multialternative decision field theory: a dynamic connectionist model of decision making. Psychological Review 108 (2), 370. Rossi, P.E., 2014. Bayesian Semi-Parametric and Non-Parametric Methods with Applications to Marketing and Micro-Econometrics. Princeton University Press. Rossi, Peter E., Allenby, Greg M., McCulloch, Robert, 2005. Bayesian Statistics and Marketing. John Wiley and Sons Ltd. Small, Kenneth A., Rosen, Harvey S., 1981. Applied welfare economics with discrete choice models. Econometrica, 105–130.



CHAPTER 3 Economic foundations of conjoint analysis

Srinivasan, Vinay, Park, Chan Su, 1997. Surprising robustness of the self-explicated approach to customer preference structure measurement. Journal of Marketing Research, 286–291. Toubia, Olivier, Hauser, John R., Simester, Duncan I., 2004. Polyhedral methods for adaptive choice-based conjoint analysis. Journal of Marketing Research 41 (1), 116–131. Trajtenberg, Manuel, 1989. The welfare analysis of product innovations, with an application to computed tomography scanners. Journal of Political Economy 97 (2), 444–479. Yang, Cathy L., Toubia, Olivier, de Jong, Martijn G., 2018. Attention, information processing and choice in incentive-aligned choice experiments. Journal of Marketing Research 55 (6), 783–800.


Empirical search and consideration sets✩


Elisabeth Honkaa , Ali Hortaçsub,c,∗ , Matthijs Wildenbeestd a UCLA

Anderson School of Management, Los Angeles, CA, United States b University of Chicago, Chicago, IL, United States c NBER, Cambridge, MA, United States d Kelley School of Business, Indiana University, Bloomington, IN, United States ∗ Corresponding author: e-mail address: [email protected]

Contents 1 Introduction ...................................................................................... 2 Theoretical framework ......................................................................... 2.1 Set-up ................................................................................ 2.2 Search method...................................................................... 2.2.1 Simultaneous search............................................................. 2.2.2 Sequential search ................................................................ 2.2.3 Discussion ......................................................................... 3 Early empirical literature ...................................................................... 3.1 Consideration set literature ....................................................... 3.1.1 Early 1990s ........................................................................ 3.1.2 Late 1990s and 2000s .......................................................... 3.1.3 2010s – present .................................................................. 3.1.4 Identification of unobserved consideration sets ............................ 3.2 Consumer search literature ....................................................... 3.2.1 Estimation of search costs for homogeneous products ................... 3.2.2 Estimation of search costs for vertically differentiated products......... 4 Recent advances: Search and consideration sets ......................................... 4.1 Searching for prices ................................................................ 4.1.1 Mehta et al. (2003)............................................................... 4.1.2 Honka (2014) ..................................................................... 4.1.3 Discussion ......................................................................... 4.1.4 De los Santos et al. (2012) ..................................................... 4.1.5 Discussion ......................................................................... 4.1.6 Honka and Chintagunta (2017) ............................................... 4.2 Searching for match values ....................................................... 4.2.1 Kim et al. (2010) and Kim et al. (2017)...................................... 4.2.2 Moraga-González et al. (2018)................................................. 4.2.3 Other papers.......................................................................

194 197 197 199 199 201 203 204 204 205 205 207 207 208 209 212 217 217 218 219 224 224 225 226 229 230 235 237

✩ We thank Stephan Seiler and Raluca Ursu for their useful comments and suggestions. Handbook of the Economics of Marketing, Volume 1, ISSN 2452-2619, https://doi.org/10.1016/bs.hem.2019.05.002 Copyright © 2019 Elsevier B.V. All rights reserved.



CHAPTER 4 Empirical search and consideration sets

5 Testing between search methods ............................................................. 5.1 De los Santos et al. (2012) ....................................................... 5.2 Honka and Chintagunta (2017) .................................................. 6 Current directions ............................................................................... 6.1 Search and learning ................................................................ 6.2 Search for multiple attributes .................................................... 6.3 Advertising and search............................................................. 6.4 Search and rankings ............................................................... 6.5 Information provision .............................................................. 6.6 Granular search data ............................................................... 6.7 Search duration ..................................................................... 6.8 Dynamic search ..................................................................... 7 Conclusions ...................................................................................... References............................................................................................

238 239 241 243 243 245 246 247 248 249 250 251 252 253

1 Introduction “Prices change with varying frequency in all markets, and, unless a market is completely centralized, no one will know all the prices which various sellers (or buyers) quote at any given time. A buyer (or seller) who wishes to ascertain the most favorable price must canvass various sellers (or buyers)—a phenomenon I shall term ‘search’.” (Stigler, 1961, p. 213)

Dating back to the classic work of Stigler (1961), a large literature in economics and marketing documents the presence of substantial price dispersion for similar, even identical goods. For example, looking across 50,000 consumer products, Hitsch et al. (2017) find that, within a 3-digit zip code, the ratio of the 95th to the 5th percentile of prices for the median UPC (brand) is 1.29 (1.43). Substantial price dispersion has been reported in many different product categories including e.g. automobiles (Zettelmeyer et al., 2006), medical devices (Grennan and Swanson, 2018), financial products (Duffie et al., 2017; Hortaçsu and Syverson, 2004; Ausubel, 1991; Allen et al., 2013), and insurance products (Brown and Goolsbee, 2002; Honka, 2014). Again dating back to Stigler (1961), the presence and persistence of price dispersion for homogeneous goods has often been attributed to search/information costs. Understanding the nature of the search and/or information costs is a crucial step towards quantifying potential losses to consumer and social surplus induced by such frictions, and to assess the impact of potential policy interventions to improve market efficiency and welfare. Quantitative analyses of consumer and social welfare rely on empirical estimates of demand and supply parameters and comparing observed market outcomes to counterfactual efficient benchmarks. However, departures from the assumption that consumers have full information pose important methodological challenges to demand (and supply) estimation methods that have been the mainstay of quantitative marketing and economics. Consider, for example, a consumer who is observed to purchase a

1 Introduction

product for $5 when the identical product is available for purchase at $4 somewhere nearby. A naive analysis may conclude that demand curves are upward sloping, or, at the very least, that demand is very inelastic. Similarly, consider the observation that lower income but high-achieving high school seniors do not apply to selective four year colleges despite being admitted at high rates (see Hoxby and Turner, 2015). This may be because these seniors do not value a college education or because they are not aware of the financial aid opportunities or their chances of admission. Indeed, as in Stigler (1961), consumers are likely not perfectly informed about both prices and non-price attributes of all products available for purchase in a given market. It is therefore important, from the perspective of achieving an accurate understanding of preferences, to gain a deeper understanding of the choice process, and especially which subsets of products actually enter a consumer’s “consideration” set1 and how much a consumer knows about the price/non-price attributes. Understanding how consumers search for products and eventually settle on the product they are observed to purchase is the subject of a large and burgeoning literature in economics and marketing. Our focus in this essay is on the econometric literature that allows for the specification of a search process, leading to the formation of consideration sets, along with a model of preferences. In much of this literature, the specification of the search process is motivated by economic models of consumer search. While this constrains the specification of the search process, economic theory provides a useful guidepost as to how consumers may search under counterfactual scenarios that may be very far out of sample. We will thus start, in Section 2, with a brief survey of popular theoretical models of consumer search that motivate econometric specifications.2 While this chapter is centered around econometric methods and models, many of these methods and models are motivated by substantial findings. For example, in addition to the price dispersion/information cost argument by Stigler (1961), empirical marketing research in the 1980s found that many consumers effectively choose from (or “consider”) a surprisingly small number of alternatives – usually 2 to 5 – before making a purchase decision (see e.g. Hauser and Wernerfelt, 1990; Roberts and Lattin, 1991; Shocker et al., 1991). This empirical observation sparked a rich stream of literature in marketing that developed and estimated models that take consumers’ consideration into account. We discuss this stream of literature, that was pioneered by Hauser and Wernerfelt (1990) and Roberts and Lattin (1991), in Section 3.1. One of the main findings from this line of research, which has been validated in more recent work, is that advertising and other promotional activities create very little true consumption utility, but first and foremost affect awareness and consideration. We then turn to the early work in economics in Section 3.2, which was primarily motivated by Stigler’s (1961) price dispersion observation and search/information 1 Throughout this chapter, we use the terms “consideration set,” “search set,” “evoked set,” and “(endogenous) choice set” interchangeably unless stated otherwise. 2 Readers interested in much more exhaustive surveys of the theoretical research can refer to Baye et al. (2006) and Anderson and Renault (2018).



CHAPTER 4 Empirical search and consideration sets

cost argument. With the proliferation of the Internet around the turn of the century and increased availability of data, researchers worked on quantifying the amount of price dispersion in online markets as well as quantifying consumer search costs. Some of the rather surprising results of these efforts were that amount of price dispersion remained substantial even in online markets, i.e. prices did not seem to follow the Law of One Price. Starting with Sorensen (2000), Hortaçsu and Syverson (2004), and Hong and Shum (2006), researchers have utilized economic theories of search to rationalize observed price dispersion patterns and to infer search costs and preference parameters. The search cost estimates recovered in these papers appeared relatively large at first sight. However, subsequent work has confirmed that the costs consumers incur while gathering information remain quite high for a variety of markets. One of the main shortcomings of the consideration set literature discussed in Section 3.1 was that it mostly used reduced-form modeling approaches. One of the main shortcomings of the early search literature in economics discussed in Section 3.2 was that it modeled consumers as randomly picking which alternatives to search. In Section 4 we turn to the more recent literature that aims to overcome both shortcomings by covering a more general setting in which, following discrete choice additive random utility models popular in demand estimation, products are both vertically and horizontally differentiated. Here an important distinction is made regarding what consumers are searching for: we discuss models in which consumers are searching for prices in Section 4.1 and models in which consumers search for a good product match or fit in Section 4.2. We also discuss approaches that utilize both individual level data and aggregate (market share) data. Papers discussed in this section have in common that they think carefully about the data is needed to identify search costs. They also advance estimation methodologies by developing approaches to handle the curse of dimensionality that appears in the simultaneous search model and the search path dimensionality problem of the sequential search model. However, many of these models are not straightforward to estimate, and more work is need to obtain models that are both realistic and tractable in terms of estimation. Since the beginning of the search literature, the question of how consumers search, i.e. whether consumers search in a simultaneous or sequential fashion, has been heavily debated. Because researchers did not think that the search method was identified using observational data, it was common to make an assumption on the type of search protocol that consumers were using (frequently driven by computational considerations). Starting with De los Santos et al. (2012) and Honka and Chintagunta (2017), researchers have begun empirically testing observable implications of sequential versus simultaneous search and the broader question of the identifiability of the search method utilized by consumers. This also highlighted the importance of expectations: which search method is supported by data patterns can change depending on whether researchers assume that consumers have rational expectations. How consumers search also has implications on the estimated search costs (and thus any subsequent analyses such as welfare calculations): if consumers search simultaneously (sequentially), but the researcher falsely assumes that they search sequentially (simultaneously), search costs will be overestimated (underestimated).

2 Theoretical framework

In Section 6, we discuss various extensions and applications of the econometric frameworks discussed in the prior sections. Section 6.1 explores generalizations of the modeling framework when consumers are not perfectly informed regarding the distribution of prices and/or match utilities and learn about these distributions as they search. Section 6.3 discusses how advertising interacts with search and choice. Section 6.4 discusses the very related setting where the ranking and/or spatial placement of different choices on for instance a webpage affect search and eventual choice. Section 6.5 considers an interesting emerging literature on the issue of information provision made available to consumers at different stages of search, e.g. at different layers of a website. Sections 6.6 and 6.7 discuss how the availability of more granular information on consumer behavior such as search duration can improve inference/testing regarding the search method and preferences. This is clearly an important area of growth for the literature as consumer actions online and, increasingly, offline are being monitored closely by firms. Finally, Section 6.8 discusses the important case in which dynamic demand considerations (such as consumer stock-piling) interact with consumer search. The econometric literature on consumer search and consideration sets is likely to grow much further beyond what is covered here as more and more data on the processes leading to eventual purchase become available for study. We therefore hope our readers will find what is to follow a useful roadmap into what has been done so far, but that they will ultimately agree with us that there are many more interesting questions to answer in this area than has been attempted so far.

2 Theoretical framework 2.1 Set-up We start by presenting the general framework of search models.3 In these models, consumers are utility maximizers. Consumers know the values of all product characteristics but one (usually price or match value) prior to searching and have to engage in search to resolve uncertainty about the value of that one product characteristic. Search is costly so consumers only search a subset of all available products. Formally, we denote consumers by i = 1, . . . , N , firms/products by j = 1, . . . , J , and time periods by t = 1, . . . , T . Consumer i’s search cost (per search) for product j is denoted by cij and the number of searches consumer i makes is denoted by ki = 1, . . . , K with K = |J |. Firm j ’s marginal cost is denoted by rj . Consumer i’s

3 The two most common types of search models are price search models and match value search models. In the former, consumer search to resolve uncertainty about prices, while in the latter consumers search to resolve uncertainty about the match value or fit. The set-up of both price and match value search models fits under the general framework presented in this section. The set-up of both types of search models is identical with one exception denoted in footnote 4. However, the estimation approaches differ as discussed in Section 4.



CHAPTER 4 Empirical search and consideration sets

indirect utility for product j is an independent draw uij from a distribution Fj with density fj where uij is given by uij = αj + Xij β + γpij + ij .


The parameters αj are the brand intercepts, Xij represents observable product and/or consumer characteristics, pij is the price, and ij is the part of the utility not observed by the researcher.4 The parameters αj , β, and γ are the utility parameters. Although the framework is constructed to analyze differentiated goods, it can also capture special cases such as a price search model for homogeneous goods with identical consumers and firms. In this case, Eq. (1) simplifies to uij = −pij with Fj = F . We start by going through the set of assumptions that most search models share. Assumption 1. Demand is unit-inelastic. In other words, each consumer buys at most one unit of the good. Assumption 1 holds for all papers discussed in this chapter. Assumption 2. Prior to searching consumers know the (true) utility distribution, but not the specific utility a firm is going to offer on a purchase occasion. To put it differently, while consumer i does not know the specific utility uij he would get from buying from firm j (potentially on a specific purchase occasion), he knows the shape and parameters of the utility distribution, i.e. the consumer knows Fj . This assumption is often referred to as the “rational expectations assumption” because it is assumed that consumers are rational and know the true distribution from which utilities are being drawn. We relax Assumption 2 in Section 6.1 and discuss papers which have studied consumer search in an environment in which consumers are uncertain about the utility distribution and learn about it while searching. Assumption 3. All uncertainty regarding the utility from consuming a specific product is resolved during a single search. In other words, consumers learn a product’s utility in a single search. In Section 6.7, we present recent work that relaxes Assumption 3. Assumption 4. The first search is free. This assumption is made for technical reasons. Two alternative assumptions are sometimes made in the literature: (search costs are sufficiently low so that) all consumers search at least once (e.g. Reinganum, 1979) or all consumers use the same search method, but it is optimal for some consumers to search/not to search depending on the level of their search costs (e.g. Janssen et al., 2005). 4 In price search models,  is observed by the consumer prior and post search. In match value search ij models, consumers do not know ij prior to search, but know its value after searching. In both price and match value search models, the researcher does neither observe ij prior nor post search.

2 Theoretical framework

2.2 Search method The consumer search literature has predominantly focused on two search methods: simultaneous and sequential search. In this subsection, we introduce these two search methods.

2.2.1 Simultaneous search Simultaneous search – also referred to as non-sequential, fixed sample, or parallel search – is a search method in which consumers commit to searching a fixed set of products (or stores or firms) before they begin searching. Consumers using this method will not stop searching until they have searched all firms in their predetermined search set. Note that – despite its name – simultaneous search does not mean that all firms have to be sampled simultaneously. Firms can be sampled one after another. What characterizes simultaneous search is the consumer’s commitment (prior to the beginning of the search process) to search a fixed set of firms. In the most general case, consumer i’s search problem consists of picking a subset of firms Si that maximizes the expected maximum utility to consumer i from searching that subset of firms net of search costs, i.e., ⎡

⎤     cij ⎦ , Si = arg max ⎣E max uij − S

j ∈S


j ∈S

where E denotes the expectation operator. Unfortunately, a simple solution for how a consumer should optimally pick the firms to be included in his search set Si does not exist for the simultaneous search model. In general, it will not be optimal to search the firms randomly, so the question is: which firms should the consumer search? This is referred to as ordered or directed search in the literature.5 When searching simultaneously, to pick the optimal set of products to be searched, the consumer has to enumerate all combinatorially possible search sets (varying by their size and composition) and calculate the corresponding expected gains of search while taking the cost of sampling all products in the search set into account, i.e., calculate the expected maximum utility minus the cost of searching for every search set (as in Eq. (2)). The following example illustrates the problem. Suppose there are four companies A, B, C, and D in the market. Then the consumer has to choose among the following search sets: A, B, C, D, AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, and ABCD. The difficulty with this approach is that the number of possible search sets grows exponentially with the number of firms |J |, i.e. if there are |J | firms in the market, the consumer chooses among 2|J | − 1 search sets.6 This exponential growth

5 See Armstrong (2017) for an overview of the theoretical literature on ordered search. 6 The researcher can reduce the number of search sets the consumer is choosing from by dropping all

search sets that do not include the consumer’s chosen (purchased) option (see Mehta et al., 2003).



CHAPTER 4 Empirical search and consideration sets

in the number of search sets is referred to as the curse of dimensionality of the simultaneous search model. One avenue to deal with the curse of dimensionality is to only estimate the simultaneous search model for markets with relatively few products (see e.g. Mehta et al., 2003). Another avenue to overcome the curse of dimensionality is to make an additional assumption which allows one to derive a simple strategy for how the consumer should optimally choose his search set. The following two assumptions have been used in the literature: 1. Assumption of first-order stochastic dominance: Vishwanath (1992) showed that, for a simultaneous search model with first-order stochastic dominance among the utility distributions, the rules derived by Weitzman (1979) constitute optimal consumer search and purchase behavior.7 2. Assumption of second-order stochastic dominance8 : Chade and Smith (2005) showed that, for a simultaneous search model with second-order stochastic dominance among the utility distributions, it is optimal for the consumer to a. rank firms in a decreasing order of their expected utilities, b. pick the optimal number of searches conditional on the ranking according to the expected utilities, and c. purchase from the firm with the highest utility among those searched. The assumptions of first- or second-order stochastic dominance are typically implemented by assuming that the means or variances, respectively, of the price distributions are identical (see e.g. Honka, 2014). While adding an additional assumption restricts the flexibility of a model, making this additional assumption allows researchers to apply the simultaneous search model to markets with a large number of products. Furthermore, the appropriateness of these assumptions is empirically testable, i.e. using price data, researchers can test the hypothesis of identical means or variances across products. In the special case of homogeneous goods with identical firms, the search problem reduces to choosing the optimal number of products to search and the dimensionality problem disappears.9 The simultaneous search model for homogeneous goods was initially proposed by Stigler (1961). In Stigler’s model, the consumer has to decide which set of firms to search. Since firms are identical, i.e., Fj = F , in the setting he analyzes, it is optimal for the consumer to randomly pick the firms to be searched. Therefore, the only objective of the consumer is to determine the optimal number of firms to search. Since goods are homogeneous in Stigler’s model, the utility function in Eq. (1) simplifies to uij = −pij . The consumer’s objective is to minimize his cost 7 See next page for a detailed discussion of the rules derived by Weitzman (1979). 8 Additionally, to apply the theory developed by Chade and Smith (2005), the researcher also have to

assume that search costs are not company-specific. 9 This discussion and results hold when search costs are identical across consumers and when search costs are heterogeneous across consumers. The discussion and results do not hold when there is heterogeneity in search costs across firms.

2 Theoretical framework

of acquiring the good, i.e. to minimize the sum of the expected price paid and his search costs. Formally, a consumer’s objective function can be written as


min k


kp (1 − F (p))k−1 f (p) dp + (k − 1) c,  

  search cost


expected min. price for k searches

where F (p) is price distribution with a minimum price p and maximum price p. The intuition behind the expression for the expected minimum price in Eq. (3) is as follows: the probability that all price draws are greater than p is given by Pr(p1 > p, . . . , pk > p) = (1 − F (p))k . This implies that the cdf of the minimum draw is 1 − (1 − F (p))k and the pdf of the minimum draw is k(1 − F (p))k−1 f (p). It can be shown that there is a unique optimal number of searches k ∗ that minimizes Eq. (3) (see e.g. Hong and Shum, 2006). This optimal number of searches k ∗ is the size of the consumer’s search set.

2.2.2 Sequential search A main drawback of the simultaneous search method is that it assumes that the consumer will continue searching even after getting a high utility realization early during the search process. For example, consider a simultaneous search in which a consumer commits to searching three firms and in the first search he gets quoted the maximum utility u. Because of Assumption 2, the consumer knows that this is the highest possible utility so it would not be optimal to continue searching. To address this drawback, the sequential search method has been developed. When searching sequentially, consumers determine, after each utility realization, whether to continue searching or to stop. Before we discuss the sequential search model in detail, we have to add another technical assumption to the list of assumptions laid out in Section 2.1: Assumption 5. Consumers have perfect recall. In other words, once a consumer has searched a firm, he remembers the utility offered by this firm going forward. This assumption is equivalent to assuming that a consumer can costlessly revisit stores already searched. In Section 6.7, we present recent work that relaxes Assumption 5. The sequential search problem in its most general form has been analyzed by Weitzman (1979). The problem of searching for the best outcome from a set of options that are independently distributed can be stated as the following dynamic programming problem     W u˜ i , S¯i = max u˜ i , max −cij + Fj (u˜ i )W (u˜ i , S¯i − {j }) +

j ∈S¯i

u˜ i

W (u, S¯i − {j })fj (u)du




CHAPTER 4 Empirical search and consideration sets

Table 1 Example. Option A B C

Reservation utilities (zij ) 14 12 10

Utilities (uij ) 11 7 9

where u˜ i is consumer i’s highest utility sampled so far and S¯i is the set of firms consumer i has not searched yet. Weitzman (1979) shows that the solution to this problem can be stated in terms of J static optimization problems. Specifically, for each product j , consumer i derives a reservation utility zij . This reservation utility zij equates the benefit and cost of searching product j , i.e., cij =

(uij − zij )fj (u)du.


This consumer- and product-specific reservation utility zij can then be used to determine the order in which products should be searched as well as when to stop searching. Specifically, Weitzman (1979) shows that it is optimal for a consumer to follow three rules: 1. search companies in a decreasing order of their reservation utilities (“selection rule”), 2. stop searching when the maximum utility among the searched firms is higher than the largest reservation utility among the not-yet-searched firms (“stopping rule”), and 3. purchase from the firm with the highest utility among those searched (“choice rule”). It is important to note that the consumer will not always purchase from the firm searched last. What follows is an example that shows when this can happen: in Table 1, we show the reservation utilities and utilities (which the consumer only knows after searching) for three firms. Given Weitzman’s (1979) selection rule, the consumer searches the firms in a decreasing order of their reservation utilities. The consumer first searches firm A and learns that the utility is 11. Using the stopping rule, the consumer determines that the maximum utility among the searched firms (11) is smaller than the largest reservation utility among the not-yet-searched firms (12) and thus decides to continue searching. In the second search, the consumer searches firm B and learns that the utility is 7. Using the stopping rule, the consumer determines that the maximum utility among the searched firms (11 from firm A) is higher than the largest reservation utility among the not-yet-searched firms (10 for firm C) and thus decides to stop searching. The consumer then purchases from the firm with the highest utility among those searched – firm A. Note that firm A is the firm the consumer searched in his first and not in his second (and last) search.

2 Theoretical framework

In the special case of homogeneous products and identical firms (i.e., uij = −pij and Fj = F ), just like for the simultaneous search model, the sequential search model greatly simplifies.10 Because firms are identical, the consumer randomly picks a firm to search. As in the more general case, the consumer needs to solve an optimal stopping problem, i.e. solve the problem of balancing the benefit of further search with the cost of searching. Following McCall (1970), the first-order condition for the optimal stopping problem is given by z = (4) c (z − p) f (p) dp  marginal cost


marginal benefit

where z is the lowest price found in the search so far. According to Eq. (4), a consumer is indifferent between continuing to search and stopping the search when the marginal cost of an additional search equals the marginal benefit of performing an additional search given the lowest price found so far. A consumer thus searches as long as the marginal benefit from searching is greater than the marginal cost of searching. The marginal benefit in Eq. (4) is the expected savings from an additional search given the lowest price found so far. Eq. (4) implies that there is a unique price z∗ for which the marginal cost of searching equals the marginal benefit of searching. This unique price z∗ is the beforementioned reservation price. Note that z∗ is a function of consumer search cost c. We can now describe the consumer’s decision rule: if the consumer gets a price draw above his reservation price, i.e. p > z∗ , he continues to search. If he gets a price draw below his reservation price, i.e. p ≤ z∗ , he stops searching and purchases. Note that in the case that all firms are identical (homogeneous good) and consumer search cost are identical across all firms, a consumer has a single (constant) reservation price z∗ (for all firms). The consumer stops searching after receiving the first price below his reservation price and makes a purchase. Thus the consumer always purchases from the firm searched last in such a setting.

2.2.3 Discussion A question that often comes up is whether one search method is “better” than the other, i.e. whether consumers should always search simultaneously or should always search sequentially. While in many settings searching sequentially is better for consumers because they can stop searching at any point of time if they get a good draw early on, Morgan and Manning (1985) showed that simultaneous search can be better than sequential search when a consumer is close to a deadline, i.e. needs to gather information quickly. Chade and Smith (2006) and Kircher (2009) further found that

10 This discussion and results hold when search costs are identical across consumers and when search

costs are heterogeneous across consumers. The discussion and results do not hold when there is heterogeneity in search costs across firms.



CHAPTER 4 Empirical search and consideration sets

simultaneous search is better when the other side of the market might reject the individual (e.g. students searching for higher education by submitting college applications). It is important to note that neither simultaneous nor sequential search by itself is the best search method for consumers. Morgan and Manning (1985) show that a combination of both simultaneous and sequential search (at the various stages of search) dominates either pure simultaneous or pure sequential search. In an experimental study, Harrison and Morgan (1990) make a direct comparison between such a hybrid search strategy and either simultaneous or sequential search strategies and find that their experimental subjects use the least restrictive search strategy if they are allowed to do so.

3 Early empirical literature 3.1 Consideration set literature The marketing literature has long recognized that consumers may not consider all products in a purchase situation. Consideration has been viewed as one among commonly three stages (awareness, consideration, and choice/purchase) in a consumer’s purchase process.11 However, the marketing literature has varied in the approaches used to view and model consideration. We structure our discussion of this stream of literature in a chronological order and review three groups of papers from the early 90’s, the late 90’s and 00’s, and more recent work from the 10’s.12 This chronological structure is not driven by time itself, but rather by papers written around the same time sharing common themes. For example, the papers from the early 90’s are rooted in the economic search literature, while the papers from late 90’s and 00’s employ more descriptive, statistical models. The last group of papers from the 10’s contains a more diverse set of papers that, for example, uses experimentation in support of statistical modeling or studies under which circumstances unobserved consideration sets can be identified. Before we dive into the details, it is important to note why consideration matters. When consumers have limited information, i.e. only consider a subset of all available products for purchase, and this limited information is not accounted for in the model and estimation, it will lead to biased preference estimates. Since preference estimates are used to calculate (price) elasticities, make recommendations on the employment of marketing mix elements, draw conclusions of the competitiveness of a market, etc., biased preference estimates might result in the wrong conclusions. This is a point that has been consistently made throughout this stream of literature.

11 These three stages are also sometimes referred to as the “purchase funnel.” 12 Roberts and Lattin (1997) provide an overview of the marketing literature on consideration sets be-

tween 1990 and 1997.

3 Early empirical literature

3.1.1 Early 1990s This group of papers contains work by Hauser and Wernerfelt (1990) and Roberts and Lattin (1991). Both papers base their approaches on the consumer search literature.13 Hauser and Wernerfelt (1990) propose that consumers search sequentially to add (differentiated) products to their consideration sets.14 Through search, consumers resolve uncertainty about the net product utility, i.e. utility minus price. The authors then discuss aggregate, market-level implications of their model such as order-ofentry penalties and competitive promotion intensity. Hauser and Wernerfelt (1990) also provide an overview table with mean or median consideration set sizes from previously published studies and the Assessor database for a variety of product categories. They find most consideration sets to include 3 to 5 brands. Roberts and Lattin (1991) develop a simultaneous search model in which consumers consider a brand as long as the utility of that brand is above an individualspecific utility threshold. To make the model estimable, the authors include a misspecification error. They calibrate their model using survey data from the ready-to-eat cereal market containing information on consumers’ considerations and purchases. Roberts and Lattin (1991) find a median consideration set size of 14.15 Lastly, the authors compare the predictive ability of their two-stage model to several benchmark models.

3.1.2 Late 1990s and 2000s This group of papers contains a body of work that focuses on more descriptive, statistical models. The main characteristics can be summarized as follows: first, consideration is neither viewed nor modeled as driven by uncertainty about a specific product attribute, e.g. price or match value. And second, there is limited empirical consensus on the drivers of consideration. Commonly, consideration is modeled as a function of marketing mix variables (advertising, display, feature, and, to a lesser extent, price). For example, Allenby and Ginter (1995) and Zhang (2006) estimate the effects of feature and display on consideration; Andrews and Srinivasan (1995) model consideration as a function of loyalty, advertising, feature, display, and price; Bronnenberg and Vanhonacker (1996) model consideration (saliency) as a function of promotion, shelf space, and recency (among others); Ching et al. (2009) model consideration as a function of price; and Goeree (2008) and Terui et al. (2011) model consideration as a function of advertising.

13 Ratchford (1980) develops a simultaneous search model for differentiated goods in which consumers

have uncertainty about prices and other product attributes and estimates the gains to searching using data on four household appliances. However, search is not explicitly connected to consideration in this paper. 14 Hauser and Wernerfelt (1990) also propose that, conditional on consideration, consumers pick a smaller subset of products that they evaluate for purchase. This evaluation is costly to consumers and consumers form the smaller subset for purchase evaluation using a simultaneous search step. 15 Roberts and Lattin (1991) explain the larger consideration set sizes by clarifying that they equate consideration with awareness and that aided awareness was used to elicit the considered brands.



CHAPTER 4 Empirical search and consideration sets

In the following, we discuss three aspects of this group of papers: modeling, decision rules, and data together with empirical results. Two approaches to modeling consideration emerge: one in which the probability of a consideration set is modeled (e.g. Andrews and Srinivasan, 1995; Chiang et al., 1999) and a second one in which the probability of a specific product being considered is modeled (e.g. Siddarth et al., 1995; Bronnenberg and Vanhonacker, 1996; Goeree, 2008). The papers also vary in terms of the specific model being estimated ranging from a heteroscedastic logit model (Allenby and Ginter, 1995) over dogit models (e.g. Siddarth et al., 1995) and (utility) threshold models (e.g. Siddarth et al., 1995; Andrews and Srinivasan, 1995) to aggregate random coefficients logit demand model based on Berry et al. (1995) (e.g. Goeree, 2008).16,17 Most of the consideration set papers published during this time period assume that consumers use compensatory decision rules, i.e. a product can “compensate” a very poor value in one attribute with a very good value in another attribute. However, a smaller set of papers models consideration using non-compensatory rules, i.e. a product attribute has to meet a certain criterion for the product to be considered and/or chosen (see also Aribarg et al., 2018). Non-compensatory rules are often proposed based bounded rationality arguments, often in the form of decision heuristics. For example, Fader and McAlister (1990) develop an elimination-by-aspects model in which consumers screen brands depending on whether these brands are on promotion. Gilbride and Allenby (2004) propose a model that can accommodate several screening rules for consideration: conjunctive, disjunctive, and compensatory. While Fader and McAlister (1990) find that their elimination-by-aspects and a compensatory model fit the data similarly (but result in different preference estimates), Gilbride and Allenby (2004) report that their conjunctive model fits the data best. In general, identification of compensatory and non-compensatory decision rules with non-experimental data is very difficult (see also Aribarg et al., 2018). Due to data restrictions, all the models suggested in the before-mentioned papers are estimated using choice data alone (usually supermarket scanner panel data). Therefore identification of consideration comes from functional form, i.e. nonlinearities of the model and modeling approaches are assessed based on model fit criteria. Few of the before-mentioned papers report predicted consideration set sizes. Two exceptions are Siddarth et al. (1995) and Bronnenberg and Vanhonacker (1996): the former paper reports that the average predicted consideration set includes 4.2 brands, while the latter predicts that the average consideration set for loyal and notloyal customers includes 1.5 and 2.8 brands, respectively. And lastly, most papers find

16 In a dogit model, a consumer probabilistically chooses from either the full set of alternatives or from a

considered set of alternatives. 17 Threshold models such as in Siddarth et al. (1995) and Andrews and Srinivasan (1995) assume that a

consumer’s utility for a product has to be above a certain value for the product to be in the consideration set. In contrast, models using non-compensatory rules assume that one or more attributes of a product have to have a value above or below a certain threshold for the product to be in the consideration set.

3 Early empirical literature

that marketing mix variables rather affect consideration than purchase. For example, Terui et al. (2011) find advertising to affect consideration, but not purchase.

3.1.3 2010s – present This group contains a more diverse set of papers that, for example, uses experimentation in support of statistical modeling, studies under which circumstances unobserved consideration sets can be identified or investigates preference heterogeneity estimates. Van Nierop et al. (2010) estimate a consideration and purchase model in which advertising affects the formation of consideration sets, but does not affect preferences. The authors combine scanner panel data with experimental data to show that consideration sets can be reliably recovered using choice data only and that feature and display affect consideration in their empirical context. And lastly, as discussed at the beginning of this section, not accounting for consumers’ limited information results in biased preference estimates (e.g. Bronnenberg and Vanhonacker, 1996; Chiang et al., 1999; Goeree, 2008; De los Santos et al., 2012; Honka, 2014). Two papers, Chiang et al. (1999) and Dong et al. (2018), put a special focus on preference heterogeneity estimates under full and limited information. Both papers find that the amount of preference heterogeneity is overestimated if consumers’ limited information is not accounted for.

3.1.4 Identification of unobserved consideration sets An important identification question in the consideration set literature as well as the search literature is whether changes in demand originate from shifts in consideration or shifts in utility. In many empirical settings the researcher does not have access to data on consideration, or may not have access to an instrument that can be excluded from utility or consideration, which makes it important to know to what extent a consideration set model can still be separately identified from a full information model. Abaluck and Adams (2018) show that consideration set probabilities are identified from asymmetries of cross-derivatives of choice probabilities with respect to attributes of competing products. This means that, for identification, it is not necessary to use data on consideration sets or to assume that there are characteristics that affect consideration set probabilities but do not appear in the utility function. In a model in which consumers have full information, consumers will consider all available options. The full consideration assumption implies that there is a symmetry in cross derivatives with respect to one or more characteristic of the product: a consumer will be equally responsive to a change in the price of product A as to a similar change in the price of product B. However, under certain assumptions it can be shown that this symmetry breaks down when a change in the characteristic of the product changes the consideration set probability of that product. Abaluck and Adams (2018) provide formal identification results for two classes of consideration set models: the “Default-Specific Consideration” (DSC) model and the “Alternative-Specific Consideration” (ASC) model. The DSC model fits into a rational inattention framework and assumes that the probability of considering other options than a default option only



CHAPTER 4 Empirical search and consideration sets

depends on the characteristics of the default option. The ASC model assumes that the probability of considering an option only depends on the characteristics of that good. In most theoretical search models, the probability of considering an option depends on the characteristics of all goods, which means that conventional search models do not fit in either the DSC or ASC framework. Even though this implies that their formal identification results do not apply directly to search models, Abaluck and Adams (2018) do suggest that cross-derivative asymmetries remain a source of identifying power for consideration probabilities in more complicated consideration set models in which the consideration probability for one good depends on the consideration probability for another good. Whether this indeed implies that conventional search models can be identified using asymmetric demand responses only is not formally shown, however, and remains an important area for future research. In a related paper, Crawford et al. (2018) show how to estimate preferences that are consistent with unobserved, heterogeneous consumer choice sets using the idea of sufficient sets. These sufficient sets are subsets of the unobserved choice sets and can be operationalized as products purchased by the consumer in the past or products contemporaneously purchased by other similar consumers. Kawaguchi et al. (2018) focus on advertising effectiveness and show how to use variation in product availability as a source of identification in consideration set models.

3.2 Consumer search literature The early empirical literature in economics initially focused on documenting price dispersion as well as testing some of the comparative statics results that were derived from theoretical search models. For instance, Sorensen (2000) examines retail prices for prescription drugs and finds substantial price variation, even after controlling for differences among pharmacies. In addition, he finds evidence that prices and price dispersion are lower for prescriptions that are purchased more frequently. This finding is consistent with search theory predictions since the gains from search are higher for frequently purchased products. Driven by the rise of e-commerce around the turn of the millennium, subsequent work focused on price dispersion for goods sold on the Internet and how online prices compared to prices for products sold in traditional brick and mortar stores. Most of these studies found substantial price dispersion for products sold online, despite the popular belief around the time that online comparison shopping would lead to Bertrand pricing. For instance, Clay et al. (2001) find considerable heterogeneity in pricing strategies for online bookstores and Clemons et al. (2002) report that online travel agents charge substantially different prices, even when given the same customer request. Starting with Hortaçsu and Syverson (2004) and Hong and Shum (2006), the literature began to move away from a reduced-form focused approach to more structural modeling. The idea was to use the structure of a theoretical search model to back out the search cost distribution from observed market data such as prices and quantities sold. In this section, we discuss both papers as well as several other studies that build

3 Early empirical literature

on these papers. Hong and Shum (2006) focus on the estimation of homogeneous goods search models, whereas Hortaçsu and Syverson (2004) allow for vertical product differentiation. However, in both papers, consumers search randomly across firms. This model feature distinguishes these two papers from more recent contributions in which consumers search a combination of horizontally and vertically differentiated firms, which makes consumers want to search in an ordered way.

3.2.1 Estimation of search costs for homogeneous products Hong and Shum (2006) develop methods to estimate search cost distributions for both simultaneous and sequential search models using only price data. An attractive feature of their simultaneous search model, which is based on Burdett and Judd (1983), is that search costs can be non-parametrically identified using price data only. To identify search costs in their sequential search model, parametric assumptions are needed. Since most of the follow-up literature has focused on simultaneous search, we will now briefly describe the essence of their estimation method for that search model. The main idea is to use the equilibrium restrictions of the theoretical search model as well as observed prices to back out the search cost distribution that is consistent with the theoretical model. As in Burdett and Judd (1983), firms are assumed to be homogeneous. Price dispersion emerges as a symmetric mixed-strategy Nash equilibrium: firms have an incentive to set lower prices to attract consumers who are searching, but at the same time face an incentive to set higher prices to extract surplus from consumers who are not searching. By playing a mixed strategy in prices according to a distribution F (p), firms can balance these two forces. Given such a price distribution F (p), a firm’s profit when setting a price p is given by  ∞  k−1 qk (1 − F (p)) , (p) = (p − r) k=1

where qk is the share of consumers who search k times and r is the firm’s unit cost. The mixed strategy equilibrium requires firms to be indifferent between setting any price in the support of the price distribution, which results in the following equilibrium profit condition: ∞   k−1 (p − r)q1 = (p − r) qk (1 − F (p)) , (5) k=1

where the expression on the left-hand side of this equation is the profit when setting a price equal to the upper bound p of the price distribution F (p). Eq. (5) has to hold for any observed price that is consistent with this equilibrium condition, i.e.,  (p − r)q1 = (pi − r)

K  k=1

 qk (1 − F (pi ))


, i = 1, . . . , n − 1,




CHAPTER 4 Empirical search and consideration sets

where K is the maximum number of firms from which a consumer obtains price quotes and n is the number of price observations. Since Eq. (6) implies n − 1 equations and K unknowns, this system can be solved for the unknowns {r, q1 , . . . , qK } as long as K < n − 1. Hong and Shum (2006) develop a Maximum Empirical Likelihood (MEL) approach to do so. To obtain a non-parametric estimate of the search cost distribution, the estimates of the qk ’s can then be combined with estimates of the critical search cost values i , which are given by i = Ep1:i − Ep1:i+1 , i = 1, 2, . . . , n − 1, where p1:i is the lowest price out of i draws from the price distribution F (p). To illustrate their empirical approach, Hong and Shum (2006) use online prices for four economics and statistics textbooks for model estimation. The estimates of their nonsequential search model indicate that roughly half of consumers do not search beyond the initial free price quote. Moraga-González and Wildenbeest (2008) extend Hong and Shum’s (2006) approach to the case of oligopoly. Besides allowing for a finite number of firms instead of infinitely many firms, the model is similar to the simultaneous search model in Hong and Shum (2006). However, instead of using Eq. (6) and a MEL approach, they use a maximum likelihood (MLE) procedure. Specifically, Moraga-González and Wildenbeest (2008) solve the first-order condition for the equilibrium price density, which is then used to construct the likelihood function. The density function is given by N f (p) =

k−1 k=1 kqk (1 − F (p))

(p − r)


k=1 k(k

− 1)qk (1 − F (p))k−2

where N is the number of firms and F (p) solves Eq. (6) for K = N . The loglikelihood function is then  LL = log f (p; q1 , . . . qN ), n

where the parameters to be estimated are the shares of consumers searching qk times. Moraga-González and Wildenbeest (2008) estimate the model using online price data for computer memory chips and find that even though a small share of consumers of around ten percent searches quite intensively, the vast majority of consumers does not obtain more than three price quotes. Moreover, estimates of average price-cost margins indicate that market power is substantial, despite having more than twenty stores operating in this market. Although MEL has some desirable properties such as requiring fewer assumptions regarding the underlying distribution, estimating the model using MEL requires solving a computationally demanding high-dimensional constrained optimization problem, which may fail to converge when the number of search cost parameters is large. Indeed, Moraga-González and Wildenbeest (2008) compare the two approaches in a

3 Early empirical literature

Monte Carlo study and find the MLE approach to work better in practice, especially with respect to pinning down the consumers who search intensively. Moreover, they find that the MLE procedure outperforms the MEL procedure in terms of fit. Several papers have extended the Hong-Shum approach. Most of these papers use a MLE approach as in Moraga-González and Wildenbeest (2008). A general finding is that in most of the markets studied, consumers either search very little (at most two times) or search a lot (close to all firms). This finding has been interpreted as some consumers using price comparison tools, which allows a consumer to get a complete picture of prices without having to visit each retailer individually. Wildenbeest (2011) adds vertical product differentiation to the framework and derives conditions under which the model can still be estimated using price data only. Specifically, by assuming that consumers have identical preferences towards quality, that input markets are perfectly competitive, and that the quality production function has constant returns to scale, he maps a vertical product differentiation model into a standard homogeneous goods model with firms playing mixed strategies in utilities. The model is estimated using price data for a basket of staple grocery items that are sold across four major supermarket chains in the United Kingdom. The estimates indicate that approximately 39 percent of price variation is explained by search frictions, while the rest is due to quality differences among stores. About 91 percent of consumers search at most two stores, suggesting that there is not a lot of search going on in this market. Moreover, ignoring vertical product differentiation when estimating the model leads to higher search cost estimates. Moraga-González et al. (2013) focus on the non-parametric identification of search costs in the simultaneous search model and show that the precision of the estimates can be improved by pooling price data from different markets. They propose a semi-nonparametric (SNP) density estimator that uses a flexible polynomial-type parametric function, which makes it possible to combine data from different markets with the same underlying distribution of search costs, but with different valuations, unit costs, and numbers of firms. The estimator is designed to maximize the joint likelihood from all markets, and as such the SNP procedure exploits the data more efficiently than the spline methods that are used in earlier papers (e.g. Hong and Shum, 2006; Moraga-González and Wildenbeest, 2008). To illustrate the estimation approach, Moraga-González et al. (2013) use a dataset of online prices for ten memory chips. Median search costs are estimated to be around $5. Search costs are dispersed, with most consumers having high enough search costs to find it optimal to search at most three stores, while a small fraction of consumers searches more than four times. Blevins and Senney (2019) add dynamics to the model by allowing consumers to be forward looking. In addition to deciding how many times to search in each period, consumers have the option to continue searching in the next period. Per-period search costs can be estimated using the approach in Moraga-González and Wildenbeest (2008) or Wildenbeest (2011), but to estimate the bounds of the population search cost distribution, a specific policy function must be estimated. Blevins and Senney (2019) apply the estimation procedure to the online market for two popular



CHAPTER 4 Empirical search and consideration sets

econometrics textbooks and find that median search costs for the dynamic model are much lower than for a static model, which suggests that search cost estimates are likely to be biased upwards when forward-looking behavior of consumers is ignored. Sanches et al. (2018) develop a minimum distance approach to estimate search costs, which is easier to implement than previous methods. In addition, they propose a two-step sieve estimator to estimate search costs when data from multiple markets are available. The sieve estimator only involves ordinary least squares estimation and is therefore easier to compute than other approaches that combine data from multiple markets, such as the SNP estimator in Moraga-González et al. (2013). As an illustration of their approach, Sanches et al. (2018) estimate search costs using online odds for English soccer matches as prices and find that search costs have fallen after bookmakers were allowed to advertise more freely as a result of a change in the law. Nishida and Remer (2018) provide an approach to combine search cost estimates from different geographic markets and show how to incorporate wholesale prices and demographics into the Hong-Shum framework. Specifically, they first nonparametrically estimate market-specific search cost distributions for a large number of geographically isolated gasoline markets using a vertical product differentiation model similar to Wildenbeest (2011). Then they use these estimates to parametrically estimate a search cost distribution that allows them to incorporate demographic information. Nishida and Remer (2018) find significant variation in search costs across the different geographic markets. Moreover, they find a positive relation between the estimated distribution of search costs and the income distribution. Zhang et al. (2018) use a MEL approach, as in Hong and Shum (2006), and show how to incorporate sales data into the estimation of both the simultaneous and the sequential search model. They show that including sales data results in estimates that are less sensitive to assumptions about the maximum number of searches consumers can conduct. Moreover, the sequential search model can be estimated nonparametrically when both price and sales data are used. The model is estimated using price and transaction data for a chemical product in a business-to-business environment. Findings show that the sequential search model provides a better fit.

3.2.2 Estimation of search costs for vertically differentiated products Hortaçsu and Syverson (2004) extend the methodology of Hong and Shum (2006) to the case where products are allowed to be vertically differentiated and a sequential search protocol is followed. Vertical differentiation takes the form of an index based on observable product attributes and an unobservable attribute, where the index weightings, along with search cost parameters, can be estimated. Unlike Hong and Shum (2006), price data alone is not sufficient to identify model parameters; quantity and market share information, along with data on product characteristics are also necessary. Like Hong and Shum (2006), nonparametric identification results are obtained for the underlying search cost distribution and the quality index for each product (which, in a manner similar to Berry (1994) and Berry et al. (1995), can be projected onto observable product characteristics with a suitable instrumental variable for the unobserved product attribute). While the model allows for specification

3 Early empirical literature

of preference heterogeneity across different consumer segments, horizontal differentiation in the form of additive random utility shocks is not considered; the model rationalizes nonzero market shares for dominated products through search costs. Empirically, Hortaçsu and Syverson (2004) study the S&P 500 index fund market, where substantial dispersion in fees is documented. While this may be surprising given that all index funds have the goal of replicating the return characteristics of the S&P 500 index, some return deviations across funds may exist, along with nonfinancial drivers of differentiation. Thus, the model allows for vertical differentiation between funds with non-trivial market shares for dominated products arising from costly sequential search. The utility from investing in fund j is a linear function of fund characteristics: u j = X j β − p j + ξj ,


where Xj are fund characteristics other than price pj and an unobservable component ξj . The coefficient on the price term is normalized to −1, so utilities are expressed in terms of basis points in fees (one hundredth of a percentage point). Thus one can think of uj as specifying fund utility per dollar of assets the investor holds in it. Search costs are heterogeneous in the investor population and follow distribution G(c). As in Carlson and Preston McAfee (1983) investors search with replacement and are allowed to revisit previously researched funds. Defining investors’ belief about the distribution of funds’ indirect utilities as H (u), the optimal search rule for an investor with search cost ci is given by the reservation utility rule u (u − u∗ )dH (u), ci ≤ u∗

where u is the upper bound of H (u), and u∗ is the indirect utility of the highest-utility fund searched up to that point. Assuming that investors observe the empirical cumulative distribution function of funds’ utilities, u1 < .... < uN , the expression for H (u) becomes H (u) =

N 1  I[uj ≤ u]. N j =1

The optimal search rule yields critical cut-off points in the search distribution given by cj =


ρk (uk − uj ),



where ρk is the probability that fund k is sampled on each search and cj is the lowest possible search cost of any investor who purchases fund j in equilibrium. Funds’ market shares can be written in terms of the search cost cdf by using the search-cost cutoffs from Eq. (8). Only investors with very high search costs (c > c1 )



CHAPTER 4 Empirical search and consideration sets

purchase the lowest-utility fund, u1 ; all others continue to search. However, not all investors with c > c1 purchase the fund; only those ones who happen to draw fund 1 first, which happens with probability ρ1 . Thus the market share of the lowest-utility fund is given by  q1 = ρ1 (1 − G(c1 )) = ρ1 1 − G


 ρk (uk − u1 )




Analogous calculations produce a generalized market share equation for funds 2 to N:  j −1 ρk G(ck ) ρ1 G(c1 )  q j = ρj 1 + + 1 − ρ1 (1 − ρ1 − · · · − ρk−1 )(1 − ρ1 − · · · − ρk ) k=2  G(cj ) . (10) − (1 − ρ1 − · · · − ρj −1 ) These equations form a system of linear equations linking market shares to cutoffs in the search cost distribution. Eq. (10) maps observed market shares to the cdf of the search cost distribution evaluated at the critical values. Given the sampling probabilities ρj , all G(cj ) can be calculated directly from market shares. Solving the linear system (10) to recover G(c1 ), . . . , G(cN−1 ) and using the fact that G(cN ) = 0 (Eq. (8) implies cN = 0 and search costs cannot be negative) gives all critical values of the cdf. If the sampling probabilities are unknown and must be estimated, the probabilities as well as the search cost distribution can be parameterized as ρ(ω1 ) and G(c; ω2 ), respectively. Given ω1 and ω2 of small enough dimension, observed market shares can be used to estimate these parameters. While market share data can be mapped into the cdf of the search cost distribution, market shares do not generally identify the level of the critical search cost values c1 , . . . , cN , but only their relative positions in the distribution. However, shares do identify search cost levels in the special but often-analyzed case of homogeneous (in all attributes but price) products with unit demands; i.e., when uj = u − pj , where u is the common indirect utility delivered by the funds. In this case, Eq. (8) implies cj =

N  k=j

ρk (u − pk − (u − pj )) =


ρk (pj − pk ).



Now, given sampling probabilities (either known or parametrically estimated), c1 , . . . , cN−1 can be calculated directly from observed fund prices using Eq. (11). In the more general case where products also differ in other attributes than price, information on fund companies’ optimal pricing decisions is required to identify cutoff search cost values. To do this, a supply side model has to be specified. Hortaçsu and Syverson (2004) assume that the F funds choose prices to maximize current

3 Early empirical literature

static profits. Let S be the total size of the market, pj and mcj be the price and (constant) marginal costs for fund j , and qj be fund j ’s market share given the price and characteristics of all sector funds. Then the profits of fund j are given by k = Sqj (p, X)(pj − mcj ). Profit maximization implies the standard first-order condition for pj : qj (p, X) + (pj − mcj )

∂qj (p, X) = 0. ∂pj


The elasticities ∂q/∂p faced by the fund are determined in part by the derivatives of the share equations (10). These derivatives are: ρ1 ρj2 g(c1 ) ρ2 ρj2 g(c2 ) ∂qj =− − ∂pj 1 − ρ1 (1 − ρ1 )(1 − ρ1 − ρ2 ) −

j −1 

ρk ρj2 g(ck )

(1 − ρ1 − · · · − ρk−1 )(1 − ρ1 − · · · − ρk )   N  ρk g(cj ) ρj


k=j +1

(1 − ρ1 − · · · − ρj −1 )



The pdf of the search cost distribution (evaluated at the cutoff points) enters the derivatives of the market share equations with respect to price (see Eq. (13)). Under Bertrand-Nash competition, the first order conditions for prices (Eq. (12)) imply: qj (p) ∂qj (p) =− . ∂pj pj − mcj


Given knowledge of marginal costs mcj , we can compute ∂qj /∂pj using the firstorder condition in Eq. (14). From Eq. (13), these derivatives form a linear system of N − 1 equations that can be used to recover the values of the search cost density function g(c) at the critical values c1 , . . . , cN−1 . If marginal costs are not known, they can be parameterized along with the search cost distribution and estimated from the price and market share data. Once both the values of the search cost cdf and pdf (evaluated at the cutoff search costs) have been identified, the level of these cutoff search costs cj in the general case of heterogeneous products can be identified. By definition, the difference between the cdf evaluated at two points is the integral of the pdf over that span of search costs. This difference can be approximated using the trapezoid method, i.e., G(cj −1 ) − G(cj ) = 0.5[g(cj −1 ) + g(cj )](cj −1 − cj ).



CHAPTER 4 Empirical search and consideration sets

This equation is inverted to express the differences between critical search cost values in terms of the cdf and pdf evaluated at those points, i.e. cj −1 − cj =

2G(cj −1 ) − G(cj ) . g(cj −1 ) + g(cj )


Given the critical values of G(c) and g(c) obtained from the data above, one can recover the cj , and from these trace out the search cost distribution.18 In non-parametric specifications, a normalization is required: the demand elasticity equations do not identify g(cN ), so a value must be chosen for the density at zero-search costs (recall that cN = 0). Finally, the critical values of the search cost distribution can be used to estimate the indirect utility function (Eq. (7)). The implied indirect utilities of the funds uj are derived from the cutoff search costs via the linear system in Eq. (8) above.19 One can then regress the sum of these values and the respective fund’s price (because of the imposed unit price coefficient) on the observable characteristics of the fund to recover β, the weights of the characteristics in the indirect utility function. One must be careful, however, as the unobservable components ξ are likely to be correlated with price, which would result in biased coefficients in ordinary least squares regressions. Therefore, as in Berry (1994) and Berry et al. (1995), one can use instrumental variables for price to avoid this problem. Estimation of the model using data on S&P 500 index funds between 1995-2000 reveals that product differentiation indeed plays an important role in this market: investors value funds’ non-financial characteristics such as fund age, total number of funds in the fund family, and tax exposure. After taking vertical product differentiation into account, fairly small but heterogeneous search costs (the difference between the 25th and 75th percentiles varies between 0.7 to 28 basis points) can rationalize the very substantial price dispersion (the 75th percentile fund charged more than three times the fee charged by the 25th percentile fund). The estimates also suggest that search costs are shifting over time, consistent with the documented influx of high search cost and financially inexperienced mutual fund investors into the market during a period of sustained stock market gains. Roussanov et al. (2018) utilize the Hortaçsu and Syverson (2004) model to analyze the broader market for U.S. equity mutual funds and find that investor search and the marketing efforts of mutual fund managers to influence investor search towards their funds can explain a substantial portion of the empirical relationship between mutual fund performance and mutual fund flows. Using their structural estimates, the 18 Any monotonically increasing function between the identified cutoff points could be consistent with

the true distribution; the trapezoid approximation essentially assumes this is linear. The approximated cdf converges to the true function as the number of funds increases. 19 In the current setup, Eq. (8) implies that u = 0, so fund utility levels are expressed relative to the least 1 desirable fund. This normalization results from the assumption that all investors purchase a fund; if there is an outside good that could be purchased without incurring a search cost, one could alternatively normalize the utility of this good to zero.

4 Recent advances: Search and consideration sets

authors find that marketing is a very important determinant along with performance, fees, and fund size. In a counterfactual exercise that bans marketing by mutual fund managers, Roussanov et al. (2018) find that capital shifts towards cheaper funds, and that capital is allocated in a manner more closely aligned with (estimated) manager skills. Econometrically estimated search models have found applications in several other important financial products markets, where products are complex and consumers are relatively inexperienced and/or uninformed about contract details. Allen et al. (2013) estimate search costs that rationalize the dispersion of mortgage rates in Canada. Woodward and Hall (2012) find substantial dispersion in mortgage closing/brokerage costs and, using a model of search with broker competition, estimate large gains (exceeding $1,000 for most borrowers) from getting one more quote. An important modeling challenge in many financial products markets is the fact that loans and most securities are priced through a process of negotiation. This poses an interesting econometric challenge in that the prices of alternatives that are not chosen by the consumer are not observed in the data. Woodward and Hall (2012), Allen et al. (2018), and Salz (2017) are recent contributions that address this problem by specifying an auction process between lenders/providers in the consumer’s consideration set. Given the importance of understanding choice frictions faced by consumers in these markets, which have been under much scrutiny and regulatory action since the 2008 financial crisis, future research in this area is very well warranted.

4 Recent advances: Search and consideration sets In this section, we discuss recent empirical work which makes an explicit connection between search and consideration, i.e. search is viewed as the process through which consumers form their consideration sets. While this idea might appear intuitive, the two streams of literature on consideration sets (in marketing) and consumer search (in economics) have existed largely separately until recently. We organize this section by consumers’ source of uncertainty: In Section 4.1, we discuss papers in which consumers search to resolve uncertainty about prices and in Section 4.2, we discuss papers in which consumers search to resolve uncertainty about the match value or product fit.

4.1 Searching for prices We structure our discussion of papers that have modeled consumer search for prices by search method: in Mehta et al. (2003), Honka (2014), and De los Santos et al. (2012), consumers search simultaneously, whereas consumers search sequentially in Honka and Chintagunta (2017).



CHAPTER 4 Empirical search and consideration sets

4.1.1 Mehta et al. (2003) The goal of Mehta et al. (2003) is to propose a structural model of consideration set formation. The authors view searching for prices as the process that consumers undergo to form consideration sets. Mehta et al. (2003) apply their model to scanner panel data, i.e., data that contains consumer purchases together with marketing activities but does not contain information on consideration sets, from two categories (liquid detergent and ketchup) with four brands in each. The authors find average predicted consideration set sizes to vary between 1.8 and 2.8 across both product categories, pointing to consumers incurring sizable search costs to resolve uncertainty about prices. Further, they find that in-store display and feature ads significantly reduce search costs, while income significantly increases search costs. Lastly, Mehta et al. (2003) report that consumers’ price sensitivity is underestimated if consumers’ limited information is not taken into account. In the following, we provide details on the modeling and estimation approach. As mentioned before, Mehta et al. (2003) develop a structural model in which consumers search simultaneously to learn about prices and this search process leads to the formation of consumers’ consideration sets. Consumer i’s utility function is given by uij t = θ qij t − pij t where prices pij t are assumed to follow an Extreme Value (EV) Type I distribution with pj ∼ EV (p j , σp2j ), and qij t being the perceived quality of a product which is observed by both the consumer and the researcher.20 The parameter θ is consumer i’s sensitivity to quality and is estimated. Note that there is no error term in the above utility specification. If an error term were to be included, Mehta et al. (2003) would not be able to separately identify the baseline search cost c0 from the true quality of brands qj . Given the distributional assumption for prices, consumer i’s utility also follows an EV Type I distribution with   uij t ∼ EV θ qij t − p j , σu2j and σpj = σuj . Mehta et al. (2003) use the choice model approach described in Section 2.2 to model consumers’ choices of consideration sets, i.e. consumers calculate the net benefit of every possible consideration set and pick the one that gives them the largest net benefit. The choice model approach is feasible despite the curse of dimensionality of the simultaneous search model because Mehta et al. (2003) apply

20 The perceived quality is assumed to be updated in a Bayesian fashion after each product purchase. We

refer the reader to Mehta et al. (2003) for details on this process.

4 Recent advances: Search and consideration sets

their model to scanner panel data from two categories and focus on the four main brands in each category.21 The expected net benefit of a specific consideration set ϒ (determined by its size k and its composition) is given by22  k    k   6σu π  θ wilt − p il − exp √ cilt ln π 6σu

√ EBϒ =




with cilt = c0 + Wilt δ. The consumer picks the consideration set (determined by its size and composition) that maximizes the net benefit of searching. Once consumer i has searched all products in his consideration set, he has learned about their prices and all uncertainty is resolved. The consumer then picks the product that provides him with the largest utility among the considered ones. Next, we describe how Mehta et al. (2003) estimate their model. Consumer i’s unconditional purchase probability is the product of consumer i’s consideration and conditional purchase probabilities, i.e. Pij = PCi Pij |Ci . The probability that consumer i considers consideration set ϒ (determined by its size k and its composition) can be written as P (Ci = ϒ) = P (EBϒ ≥ EB ∀ϒ = ) . Lastly, given that the qualities qij t are truncated normal random variables (see Mehta et al., 2003), the conditional choice probabilities are given by probit probabilities.

4.1.2 Honka (2014) Honka (2014) studies the auto insurance market. Insurance markets can be inefficient for several reasons with adverse selection and moral hazard being the two most extensively studied reasons. Honka (2014) investigates a different source of market inefficiency: market frictions. She focuses on two types of market frictions, namely, search and switching costs and estimates demand for auto insurance in their presence. Honka (2014) uses an individual-level data set in which, in addition to information and purchases, prices, and marketing activities, she also observes consumers’ consideration sets. She finds search costs to vary from $35 to $170 and switching costs of $40. Further, she reports that search costs are the main driver of the very high customer retention rate in this industry and their elimination is the main lever to increase consumer welfare. 21 Mehta et al. (2003) can reduce the number of consideration sets by dropping those that do not include

the purchased brand. 22 Since prices and thus utilities (see Eq. (4.1)) follow an EV Type I distribution, the maximum utility of

a set of EV Type I distributions also follows an EV Type I distribution (Johnson et al., 1995).



CHAPTER 4 Empirical search and consideration sets

In the following, we provide details on the modeling and estimation approach: as in Mehta et al. (2003), Honka (2014) also estimates a structural model in which consumers search simultaneously to resolve uncertainty about prices. She models search as the process through which consumers form their consideration sets. Consumer i’s indirect utility for company j is given by uij = αij + βi Iij,t−1 + γpij + Xij ρ + Wi φ + ij with αij being consumer-specific brand intercepts. βi captures consumer inertia and can be decomposed into βi = β˜ + Zi κ with Zi being observable consumer characteristics. Iij,t−1 is a dummy variable indicating whether company j is consumer i’s previous insurer. Note that as observed heterogeneity interacts with Iij,t−1 , it plays a role in the conditional choice decisions. The parameter γ captures a consumer’s price sensitivity and pij denotes the price charged by company j . Note that – in contrast to the consumer packaged goods industry – in the auto insurance industry, prices depend on consumer characteristics. Prices pij follow an EV Type I distribution with location parameter ηij and scale parameter μ.23 Given that consumers know the distributions of prices in the market, they know ηij and μ. Xij is a vector of product- and consumer-specific attributes and Wi contains regional fixed effects, demographic and psychographic factors that are common across j . Although these factors drop out of the conditional choice decision, they may play a role in the search and consideration decisions. And lastly, ij captures the component of the utility that is observed by the consumer but not the researcher. Given that Honka (2014) studies a market with 17 companies, the curse of dimensionality of the simultaneous search model has to be overcome (see also Section 2.2). To do so, she assumes first-order stochastic dominance among the price belief distributions and uses the optimal selection strategy for consideration sets suggested by Chade and Smith (2005).24 She assumes a specific form of first-order stochastic dominance, namely, that the price belief distributions have consumer- and companyspecific means but the same variance across all companies and tests the appropriateness of this assumption using data on prices. Note that the consumer makes the decisions of which and how many companies to search at the same time. For expository purposes, we first discuss the consumer’s decision of which companies to search followed by the consumer’s decision of how many companies to search. Both decisions are jointly estimated. A consumer’s decision regarding which companies to search depends on the expected indirect utilities (EIU; Chade and Smith, 2005) where the expectation is taken with respect to the

23 This means the PDF is given by f (x) = μ exp (−μ (x − η)) exp (− exp (−μ (x − η))) and CDF is given by F (x) = exp (− exp (−μ (x − η))) with location parameter η and scale parameter μ. Mean is

2 η + eμc and variance π 2 where ec is Euler constant (Ben-Akiva and Lerman, 1985). 6μ

24 Honka (2014) also assumes that search costs are not company-specific – an assumption that also has to

be made to apply the theoretical results developed by Chade and Smith (2005).

4 Recent advances: Search and consideration sets

characteristic the consumer is searching for – in this case, prices. So consumer i’s EIU is given by     E uij = αij + βi Iij,t−1 + γ E pij + Xij ρ + Wi φ + ij . Consumer i observes these EIUs for every company in his market (including ij ). To decide which companies to search, consumer i ranks all companies other than his previous insurance provider (because the consumer gets a free renewal offer from the previous insurer) according to their EIUs (Chade and Smith, 2005) and then picks the top k companies to search. Rik denotes the set of top k companies consumer i ranks highest according to their EIU. For example, Ri1 contains the company with the highest expected utility for consumer i, Ri2 contains the companies with the two highest expected utilities for consumer i, etc. To decide on the number of companies k a consumer searches, the consumer calculates the net benefit of all possible search sets given the ranking of EIUs, i.e. if there are N companies in the market, the consumer can choose among N − 1 search sets (one free quote comes from the previous insurer). A consumer’s benefit of a searched set is then given by the expected maximum utility among the searched brands. Given the EV distribution of prices, the maximum utility also has an EV distribution ⎞ ⎛    1 max uij ∼ EV ⎝ ln (17) exp baij , b⎠ j ∈Rik b j ∈Rik

with aij = αij + βi Iij,t−1 + γ ηij + Xij ρ + Wi φ + ij and b = μ γ . If we further define    1 a˜ Rik = b ln j ∈Rik exp baij , then the benefit of a searched set is given by   ec E max uij = a˜ Rik + j ∈Rik b where ec denotes the Euler constant. The consumer picks Sik which maximizes his net benefit of searching denoted by i,k+1 , i.e. the expected maximum utility among the considered companies minus the cost of search, given by ⎡ ⎤ i,k+1 = E ⎣

max $ uij # j ∈Rik ∪ jIij,t−1

⎦ − kci .


The consumer picks the number of searches k which maximizes his net benefit of search. If a consumer decides to search k companies, he pays kci on search costs and has k + 1 companies in his consideration set. Consumers can be heterogeneous in both preferences and search costs. Consumerspecific effects in both the utility function and search costs are not identified because of the linear relationship between utilities and search costs in Eq. (18). If we increase,



CHAPTER 4 Empirical search and consideration sets

for example, the effect of a demographic factor in the utility function and decrease its effect on search costs by an appropriate amount and the benefit of a consideration set, i,k+1 , would remain the same. In the empirical specification, Honka (2014) therefore controls for observed and unobserved heterogeneity in the utility function and for quoting channels (e.g. agent, insurer website) in search costs. This concludes the description of how a consumer forms his consideration set. Once a consumer has formed his consideration set and received all price quotes he requested, all price uncertainty is resolved. Both the consumer and the researcher observe prices. The consumer then picks the company with the highest utility among the considered companies with the utilities now including the quoted prices for consumer i by company j . Next, we describe how Honka (2014) estimates her model. The crucial differences between what the consumer observes and what the researcher observes are as follows: 1. Whereas the consumer knows each company’s position in the EIU ranking, the researcher only partially observes the ranking by observing which companies are being searched and which ones are not being searched. 2. In contrast to the consumer, the researcher does not observe αij and ij . Honka (2014) tackles the first point by pointing out that partially observing the ranking contains information that allows her to estimate the composition of consideration sets. Because the consumer ranks the companies according to their EIU and only searches the highest ranked companies, the researcher knows from observing which companies are searched that the EIUs among all the searched companies have to be larger than the EIUs of the non-searched companies or, to put it differently, that the minimum EIU among the searched companies has to be larger than the maximum EIU among the non-searched companies, i.e.       min E uij ≥ max E uij  . j  ∈S / i

j ∈Si

As a consumer decides simultaneously which and how many companies to search, the following condition has to hold for any searched set       min E uij ≥ max E uij  ∩ ik ≥ ik 

j ∈Si

j  ∈S / i

∀k = k 


i.e. the minimum EIU among the searched brands is larger than the maximum EIU among the non-searched brands and the net benefit of the chosen searched set of size k is larger than the net benefit of any other search set of size k  . And finally, Honka (2014) accounts for the fact that the researcher does not observe αij and ij by integrating over their distributions. She assumes that α ∼ MV N (α, ¯ α ) where α¯ and α contain parameters to be estimated and ij ∼ EV Type I(0, 1). Then the probability that a consumer picks a consideration set ϒ is

4 Recent advances: Search and consideration sets

given by 


      = Pr min E uij ≥ max E uij  ∩ i,k+1 ≥ i,k  +1 j  ∈S / i

j ∈Si

∀k = k

 . (20)

Note that the quote from the previous insurer directly influences the consumer’s choice of the size of a consideration set. A consumer renews his insurance policy with his previous provider if the utility of doing so is larger than the expected net benefit i,k+1 of any number of searches. Next, she turns to the purchase decision given consideration. The consumer’s choice probability conditional on his consideration set is   (21) Pij |ϒ,α, = Pr uij ≥ uij  ∀j = j  , j, j  ∈ Ci where uij now contains the quoted prices. Note that there is a selection issue: Given a consumer’s search decision, ij do not follow an EV Type I distribution and the conditional choice probabilities do not have a closed-form expression. The consumer’s unconditional choice probability is given by Pij |α, = Piϒ|α, Pij |ϒ,α, .


In summary, the researcher estimates the price distributions, only partially observes the utility rankings, and does not observe αij and ij in the consumer’s utility function. Accounting for these issues Honka (2014) derived an estimable model with consideration set probability given by (20) and the conditional and unconditional purchase probabilities given by (21) and (22). Parameters are estimated by maximizing the joint likelihood of consideration and purchase given by ⎛ ⎞ J N +∞ +∞ % L % % δ ϑil ⎝ Piϒ|α, Pijij|ϒ,α ⎠ f (α) f () dαd L= i=1 −∞


l=1 j =1

where ϑil indicates the chosen consideration set and δij the chosen company. Neither the consideration set nor the conditional purchase probability have a closed-form solution. Honka (2014) therefore uses a simulation approach to calculate them. In particular, she simulates from the distributions of αij and ij . She uses a kernelsmoothed frequency simulator (McFadden, 1989) in the estimation and smooths the probabilities using a multivariate scaled logistic CDF (Gumbel, 1961) F (w1 , . . . , wT ; s1 , . . . , sT ) =




t=1 exp (−st wt )

∀ t = 1, . . . , T


where s1 , . . . , sT are scaling parameters. McFadden (1989) suggests this kernelsmoothed frequency simulator which satisfies the summing-up condition, i.e. that probabilities sum up to 1, and is asymptotically unbiased.



CHAPTER 4 Empirical search and consideration sets

4.1.3 Discussion In discussing Mehta et al. (2003) and Honka (2014), we start with pointing out the similarities: both papers estimate structural demand models, view search as the process through which consumers form their consideration sets, and use simultaneous search models. However, there are also multiple differences between the two papers. First, Mehta et al. (2003) only have access to purchase data and thus search cost identification comes from functional form, i.e. the data do not contain any direct search outcome measures (e.g. number of searches) and search costs are identified based on model non-linearities and fit. In contrast, Honka (2014) observes the sizes of consumers’ consideration sets and thus search costs are identified by variation in consumers’ consideration set sizes in her empirical application. Second, the utility function in Mehta et al. (2003) does not have an error term for identification reasons. The addition of such an error term would preclude Mehta et al. (2003) from separately identifying baseline search cost and the true quality of brands. Third, Mehta et al. (2003) use an exclusion restriction: the authors assume that promotional activities such as in-store display and feature affect consumers’ search costs but not their utilities. In contrast, advertising enters consumers’ utilities (but not their search costs) in Honka (2014). Note that – without exogenous variation – the effects of advertising and promotional activities on both utility and search cost are only identified based on model non-linearities and fit. And lastly, Mehta et al. (2003) and Honka (2014) use different approaches to deal with the curse of dimensionality of the simultaneous search model. While Mehta et al. (2003) use the choice model approach which makes estimation only feasible in categories with a small number of options, Honka (2014) applies the theory developed by Chade and Smith (2005) and develops an estimation approach that is feasible in categories with a large number of options. However, she has to assume second-order stochastic dominance among the price distributions and that search costs do not vary across firms – two assumptions that Mehta et al. (2003) do not have to make.

4.1.4 De los Santos et al. (2012) De los Santos et al. (2012) develop empirically testable predictions of data patterns that identify whether consumers search simultaneously or sequentially (see also Section 5). They use online browsing and purchase data for books to test these predictions and find evidence that consumers search simultaneously. Next, De los Santos et al. (2012) estimate a simultaneous search model and find average search costs of $1.35. In the following, we provide details on the modeling and estimation approach: De los Santos et al. (2012) estimate a model in which consumers search across differentiated online bookstores using simultaneous search. As in Mehta et al. (2003) and Honka (2014), consumers search for prices. Consumer’s i indirect utility of buying product at store j is given by uij = μj + Xi βj + αi pj + εij , where μj are store fixed effects, Xi are consumer characteristics, pj is store j ’s price for the product, and εij is an EV Type I-distributed utility shock that is observed by the consumer. Consumer search costs ci are consumer-specific and depend on consumer characteristics. Let miS denote the expected maximum utility of visiting the stores in S net of search

4 Recent advances: Search and consideration sets

  costs, i.e., miS = E maxj ∈S {uij } − k · ci . By adding an EV Type I choice-set specific error term ζiS to miS , the probability that consumer i finds it optimal to search a subset of stores S can be written as PiS = 

exp[miS /σζ ] , S  ∈S exp[miS  /σζ ]


where σζ is the scale parameter for ζiS . Conditional on visiting stores in S, the probability of purchasing from store j is then Pij |S = Pr(uij > uik ∀ k = j ∈ S). This probability does not have a closed-form solution, because a store with a higher ε draw is more likely to be selected in the choice-set selection stage. The probability of observing consumer i visiting all store in S and buying from store j is found by multiplying the two probabilities, i.e., Pij S = PiS Pij |S .


De los Santos et al. (2012) estimate the model using simulated maximum likelihood. They observe the stores visited by consumers in their data, and use Pij S in Eq. (25) to construct the log-likelihood function, i.e., the log-likelihood function is  log Pˆij S , LL = i

where Pˆij S is the probability that individual i bought at store j from  the observed choice set S. To obtain a closed-form expression for E maxj ∈S {uij } , De los Santos et al. (2012) follow Mehta et al. (2003) and Honka (2014) in their assumption that prices follow an EV Type I distribution with known parameters γj and σ , i.e. ⎛ ⎞      μj + Xi βj + εij + αi γj ⎠. exp E max{uij } = αi σ log ⎝ j ∈S αi σ j ∈S

Note that the utility shock εij that appears in both probabilities is integrated out using simulation, as part of a simulated maximum likelihood procedure.

4.1.5 Discussion There are several similarities and differences between De los Santos et al. (2012) and the two previously discussed papers, Mehta et al. (2003) and Honka (2014). First, De los Santos et al. (2012) have more detailed data on consumer search than Mehta et al. (2003) or Honka (2014): while Mehta et al. (2003) do not observe consumer search at all in their data and Honka (2014) observes consumers’ consideration sets, De los Santos et al. (2012) observe consumers’ consideration sets and the sequence of searches. Second, the error structures in the three papers are different: the utility



CHAPTER 4 Empirical search and consideration sets

function in Mehta et al. (2003) does not contain a classic error term. While there is an error term in the utility function in Honka (2014) and De los Santos et al. (2012), in De los Santos et al. (2012) there is also a search-set specific error term (see also Moraga-González et al., 2015). As shown in Eq. (24), the latter error term gives De los Santos et al. (2012) a closed-form solution for the search set probabilities. This closed-form solution makes estimation of the model easier, but may necessitate a discussion of what this search-set specific error term, which is assumed to be independent across (possibly similar) search sets, represents. And lastly, the positioning and contributions of the three papers are different: Mehta et al. (2003) is one of the first structural search models estimated with individual-level data on purchases. The contribution of this paper lies in the model development. Honka (2014) extends the model of Mehta et al. (2003) and develops an estimation approach that is feasible even in markets with a large number of alternatives. De los Santos et al.’s (2012) primary contribution is to show that consumers’ search method can be identified when the sequence of searches (and characteristics of searched products) made by individual consumers is observed in the data (see also Section 5).

4.1.6 Honka and Chintagunta (2017) Similar to De los Santos et al. (2012), Honka and Chintagunta (2017) are also primarily interested in the question of search method identification. They analytically show that consumers’ search method is identified by patterns of prices in consumers’ consideration sets (see also Section 5). They use the same data as Honka (2014), i.e. cross-sectional data in which consumers’ purchases and consideration sets are observed, to empirically test whether consumers search simultaneously or sequentially in the auto insurance industry. They find the price pattern to be consistent with simultaneous search. Then Honka and Chintagunta (2017) estimate both a simultaneous and a sequential search model and find preference and search cost estimates to be severely biased when the incorrect assumption on consumers’ search method is made. In the following, we discuss the details of the modeling and estimation approach for the sequential search model: Honka and Chintagunta (2017) develop an estimation approach for situations in which the researcher has access to individual-level data on consumers’ consideration sets (but not the sequence of searches) and purchases. Suppose consumer i’s utility for company j is given by uij = αij + βpij + Xij γ + ij where ij are iid and observed by the consumer, but not by the researcher. αij are p brand intercepts and pij are prices which follow a normal distribution with mean μij . Even though the sequence of searches is not observed, observing a consumer’s consideration set allows the researcher to draw two conclusions based on Weitzman’s (1979) rules: First, the minimum reservation utility among the searched companies has to be larger than the maximum reservation utility among the non-searched com-

4 Recent advances: Search and consideration sets

panies (based on the selection rule), i.e. min zij ≥ max zij  j  ∈S / i

j ∈Si


Otherwise, the consumer would have chosen to search a different set of companies. And second, the stopping and choice rules can be combined to the following condition max uij ≥ uij  , max zij  j ∈Si

j  ∈S / i

∀j  ∈ Si \ {j }


i.e. that the maximum utility among the searched companies is larger than any other utility among the considered companies and the maximum reservation utility among the non-considered companies. Eqs. (26) and (27) are conditions that have to hold based on Weitzman’s (1979) rules for optimal behavior under sequential search and given the search and purchase outcome that is observed in the data. At the same time, it must also have been optimal for the consumer not to stop searching and purchase earlier given Weitzman’s (1979) rules. The challenge is that the order in which the consumer collected the price quotes is not observed. The critical realization is that, given the parameter estimates, the observed behavior must have a high probability of having been optimal. To illustrate, suppose a consumer searches three companies. Then the parameter estimates also have to satisfy the conditions under which it would have been optimal for the consumer to continue searching after his first and second search. Formally, in the estimation, given a set of estimates for the unknown parameters, for each consumer i, let us rank all searched companies j according to their reservation utilities zˆ it (the “^” symbol refers to quantities computed at the current set of estimates) where t = 1, ..., k indicates the rank of a consumer’s reservation utility among the searched companies. Note that t = 1 (t = k) denotes the company with the largest (smallest) reservation utility zˆ it . Further rank all utilities of searched companies in the same order as the reservation utilities, i.e. uˆ i,t=1 denotes the utility for the company with the highest reservation utility zˆ it=1 . Then given the current parameter estimates, the following conditions have to hold uˆ i,t=1 < zˆ it=2

max uˆ it < zˆ i,t=3


In other words, although by definition the reservation utility of the company with t = 1 is larger than that with t = 2, the utility of the company with t = 1 is smaller than the reservation utility of the company with t = 2 thereby prompting the consumer to do a second search. Similarly, the maximum utility from the (predicted) first and second search has to be smaller than the reservation utility from the (predicted) third search; otherwise the consumer would not have searched a third time. Generally, for a consumer searching t = 2, . . . , k companies, the following set of conditions has to



CHAPTER 4 Empirical search and consideration sets

hold k & l=2

max uˆ it < zˆ it=l . t zK+1 , which implies a lower bound of zK+1 − δj in the integral in Eq. (33). 27 For J = K the term (1 −  (z − δ ))π has to be added to Eq. (33). j j j j



CHAPTER 4 Empirical search and consideration sets

Product j ’s predicted market share sˆj is obtained by averaging the buying probabilities Pr(j ) across consumers. The relation between market shares and sales ranks for pairs of products is modeled as follows:  1 if sj > sl ; S Ij l = 0 otherwise. Assuming that the difference between the actual and predicted market share has a iid

normally distributed measurement error, i.e., sj = sˆj + εjS with εjS ∼ N (0, τS2 /2), we get   sˆj (, X) − sˆl (, X) Pr(IjSl = 1|, X) =  . (34) τS The data used for estimation also contains information on the top products that were purchased conditional on searching for a specific product. These choices conditional on search correspond to the probability that product j is chosen conditional on searching an option l, i.e., J K=max(j,l) Pr(j, SK ) Pr(j |l) = , πl where K = j if j has a lower reservation utility than l and K = l otherwise, and πl is the probability of searching the lth option. Assuming that the difference between the predicted and observed conditional choice share data represents a measurement iid

error, i.e., sj |l = sˆj |l (|X) + εjCl with εjCl ∼ N (0, τC2 ), we get  Pr(sj |l |, X) = 

sˆj |l (, X) − sj |l τC



Combining the probabilities in Eqs. (32), (34), and (35) and summing over all relevant products, gives the following log-likelihood function, which is used to estimate the model by maximum likelihood:   V Pr(Ij,lk = 1|, X) + Pr(IjSl = 1|, X) LL(|Y, X) = j


l =j k =l



l j

Pr(sj |l |, X).


Kim et al. (2017) estimate their model using view rank data, sales rank data, and data on choices conditional on search. They find mean and median search costs of $1.30 and $0.25, respectively, and predict median and mean search set sizes conditional on choice of 17 and 10 products, respectively. Kim et al. (2017) use their results to investigate substitution patterns in the camcorder market and conduct a market structure analysis using the framework of clout and vulnerability.

4 Recent advances: Search and consideration sets

4.2.2 Moraga-González et al. (2018) Moraga-González et al. (2018) develop a structural model of demand for car purchases in the Netherlands. The starting point for their search model is an aggregate random coefficients logit demand model in the spirit of Berry et al. (1995). However, whereas Berry et al. (1995) assume consumers have full information, MoragaGonzález et al. (2018) assume that consumers have to search to obtain information on εij in Eq. (30). As in Kim et al. (2017), consumers are assumed to search sequentially. To deal with the aforementioned search path dimensionality problem that arises because of the number of search paths that result in a purchase increases factorially in the number of products, Moraga-González et al. (2018) use insights from Armstrong (2017) and Choi et al. (2018) that make it possible to treat the sequential search problem as a discrete choice problem in which it is not necessary to keep track of which firms are visited by the consumer. Specifically, for every alternative (i.e. dealer) f , define the random variable wif = min{zif , uif }, where zif is the reservation utility for alternative f . According to Armstrong (2017) and Choi et al. (2018), the solution to the sequential search problem is equivalent to picking the alternative with the highest wif from all firms. To see that this is indeed optimal, consider the following example. Suppose there are three products, A, B, and C. The reservation and (ex-ante unobserved) realized utilities are 5 and 2 for product A, 10 and 4 for product B, and 3 and 7 for product C, respectively. Using Weitzman’s optimal search rules, the consumer first searches product B because it has the highest reservation utility, but continues searching product A because the realized utility for product B of 4 is smaller than product A’s reservation utility of 5. The consumer then buys product B because the next-best product in terms of reservation utilities, product C, has a reservation utility of 3, which means that the highest observed realized utility of 4 does not justify searching further. Now characterizing the search problem in terms of w avoids having to go through the specific ordering of firms in terms of reservation utilities and immediately reveals which product will be bought: since wB = min{10, 4} = 4 exceeds wA = min{5, 2} = 2 as well as wC = min{3, 7} = 3, product B will be purchased. Note that no additional assumptions have been made to resolve the search path dimensionality problem—all that is used is a re-characterization of Weitzman’s optimal search rules. To use this alternative characterization of Weitzman’s optimal search rules in practice, the distribution of wif has to be derived—this can be obtained by deriving the CDF of the minimum of two independent random variables: Fifw (x) = 1 − (1 − Fifz (x))(1 − Fif (x)), where Fif is the utility distribution and Fifz is the distribution of reservation utilities. Since Fifz (x) = 1 − Fifc (H (x)) with Fifc being the search cost CDF and H (x) being



CHAPTER 4 Empirical search and consideration sets

the gains from search, this can be written as Fifw (x) = 1 − Fifc (H (x))(1 − Fif (x)). The probability that consumer i buys from dealer f is then given by Pif = Pr(wig < wif ∀ g = f ) %    w = Fig (wif ) fifw wif dwif , g =f

where Fifw and fifw are the CDF and PDF of wif , respectively. The probability that consumer i buys car j conditional on buying from seller f is given by   exp δij Pij |f =  . h∈Gf exp (δih ) The probability that buyer i buys product j is thus sij = Pij |f Pif . Note that these expressions are not necessarily closed-form. Although one can use numerical methods to directly estimate these expressions, this may slow down model estimation enough to make using it impractical. To speed up the estimation, MoragaGonzález et al. (2018) provide an alternative approach by working backwards and deriving a search cost distribution that gives a tractable closed-form expression for the buying probabilities. Specifically, a Gumbel distribution for w with location parameter δif − μif , where μif contains search cost shifters and parameters, can be obtained using the following search cost CDF: Fifc (c) =

1 − exp(− exp(−H0−1 (c) − μif )) 1 − exp(− exp(−H0−1 (c)))


+∞ where H0 (z) = z (u − z)dF (u) represents the (normalized) gains from search at z. Product j ’s purchase probability then simplifies to sij =

exp(δij − μif ) . J 1 + k=1 exp(δik − μig )

The closed-form expression for the purchase probabilities makes the model estimation of similar difficulty as most full information discrete choice models of demand. The estimation of the model closely resembles the estimation in Berry et al. (1995) – the most basic version of the model can be estimated using market shares, prices, product characteristics, as well as a search cost shifter (e.g. distance from the consumer to the car dealer is used in the application). The similarity with the framework in Berry et al. (1995) allows for dealing with price endogeneity: as in Berry

4 Recent advances: Search and consideration sets

et al. (1995), Moraga-González et al. (2018) allow for an unobserved quality component in the utility function, i.e., δij = αi pj + Xj βij + ξj , and allow ξj to be correlated with prices. When the model is estimated using aggregate data, the essence of the estimation method is to match predicted market shares to observed market shares, i.e., sj (ξj , θ) − sˆj = 0 for all products j , which gives a nonlinear system of equations in ξ . As in Berry et al. (1995), this system can be solved for ξ through contraction mapping. The identification assumption is that the demand unobservables are mean independent of a set of exogenous instruments W , i.e., E[ξj |W ] = 0, so that the model can be estimated using GMM while allowing for price endogeneity, as in Berry et al. (1995). An important limitation of estimating the model using data on market shares and product characteristics is that variables that enter the search cost specification have to be excluded from the utility function. For instance, distance cannot both affect utility and search costs when only purchase data (either aggregate or individual specific) is used to estimate the model. However, Moraga-González et al. (2018) show that when similar covariates appear in both the utility specification and the search cost specification, in their model it is possible to separate the effects of these common shifters using search data. Search behavior depends on reservation values, which respond differently to changes in utility than to changes in search costs; variation in observed search decisions therefore allows one the separately estimate the effect of common utility shifters and search cost shifters. To exploit this fully, Moraga-González et al. (2018) use moments based on individual purchase and search data for their main specification, which are constructed by matching predicted search probabilities to consumer-specific information on store visits from a survey. The aggregate sales data is then used to obtain the linear part of utility, following the two-step procedure in Berry et al. (2004). Search costs are found to be sizable, which is consistent with the limited search activity observed in this market. Moreover, demand is estimated to be more inelastic in the search model than in a full information setting. The counterfactual results suggest that the price of the average car would be €341 lower in the absence of search frictions.

4.2.3 Other papers Here, we discuss several other papers that have also modeled consumer search for a match value. We start by reviewing papers that assume that consumers search sequentially and then discuss papers that assume that consumers search simultaneously. Koulayev (2014) analyzes search decisions on a hotel comparison site using clickstream data, i.e. individual-level data with observed search sequences. The paper models the decision to click on one of the listed hotels or to move to the next page of search results. Koulayev (2014) derives inequality conditions that are implied by search and click decisions and which are used to derive the likelihood function for the joint click and search behavior. He finds that search costs are relatively large: median search costs are around $10 per result page. Note that Koulayev’s approach leads to



CHAPTER 4 Empirical search and consideration sets

multi-dimensional integrals for choice and search probabilities, which is manageable in settings with a small number of search options, but is potentially problematic in settings with larger choice sets. Jolivet and Turon (2018) derive a set of inequalities from Weitzman’s (1979) characterization of the optimal sequential search procedure and use these inequalities to set identify distributions of demand side parameters. The model is estimated using transaction data for CDs sold at a French e-commerce platform. Findings suggest that positive search costs are needed to explain 22 percent of the transactions and that there is heterogeneity in search costs. Dong et al. (2018) point out that search costs may lead to persistence in purchase decisions by amplifying preference heterogeneity and show that ignoring search frictions leads to an overestimation of preference heterogeneity. The authors use search and purchase decisions of individual consumers shopping for cosmetics at a large Chinese online store to separately identify preference heterogeneity from search cost heterogeneity. Two drawbacks are that the authors only observe a small proportion of consumers making repeat purchases in their data and that they have to add an error term to search costs to be able to estimate the model. Dong et al. (2018) find that the standard deviations of product intercepts are overestimated by 30 percent if search frictions are ignored, which has implications for targeted marketing strategies such as personalized pricing. Several studies have used a simultaneous instead of a sequential search framework when modeling consumer search for the match value. An advantage of simultaneous search is that search decisions are made before search outcomes are realized, which typically makes it easier to obtain closed-form expressions for purchase probabilities. However, as discussed in Section 2, the number of choice sets increases exponentially in the number of alternatives that can be searched (curse of dimensionality of the simultaneous search model). Moraga-González et al. (2015) show that tractability can be achieved by adding an EV Type I distributed choice-set specific error term to the search costs that correspond to a specific subset of firms. Murry and Zhou (2018) use this framework in combination with individual-level transaction data for new cars to quantify how geographical concentration among car dealers affects competition and search behavior. Donna et al. (2018) estimate the welfare effects of intermediation in the Portuguese outdoor advertising industry using a demand model that extends this framework to allow for nested logit preferences. Finally, Ershov (2018) develops a structural model of supply and demand to estimate the effects of search frictions in the mobile app market and uses the search model in Moraga-González et al. (2015) as a demand side model.

5 Testing between search methods In this section, we discuss the identification of the search method consumers are using. Beyond intellectual curiosity, the search conduct, i.e. search method, matters for consumers’ decision-making. It influences which and how many products a con-

5 Testing between search methods

sumer searches. More specifically, holding a consumer’s preferences and search cost constant, a consumer might end up with a different consideration set depending on whether he searches simultaneously or sequentially.28 In fact, it can be shown that, again holding a consumers’ preferences and search cost constant, a consumer who searches sequentially, on average, has a smaller search set compared to when the same consumer searches simultaneously.29 From a researcher’s perspective, this implies that estimates of consumer preferences and search cost will be biased under the incorrect assumption on the search method. This bias of consumer preferences and search cost estimates can, in turn, lead to, for example, biased (price) elasticities and different results in counterfactual simulations. For example, Honka and Chintagunta (2017) show that consideration set sizes and purchase market shares of the largest companies are over predicted under when it is wrongfully assumed that consumers search sequentially. For a long time, it was believed that the search method is not identified in observational data. In a first attempt to test between search methods, Chen et al. (2007) develop nonparametric likelihood ratio model selection tests which allow them to test between simultaneous and sequential search models. Using the Hong and Shum (2006) framework and data on prices, they do not find significant differences between the simultaneous and sequential search models using the usual significance levels in their empirical application. Next, we discuss two papers that developed tests to distinguish between simultaneous and sequential search using individual-specific data on search behavior. In Section 5.1, we discuss De los Santos et al. (2012), who use data on the sequence of searches to discriminate between simultaneous and sequential search, whereas in Section 5.2, we discuss Honka and Chintagunta (2017), who develop a test that does not require the researcher to observe search sequences. Jindal and Aribarg (2018) apply the key identification insights from De los Santos et al. (2012) and Honka and Chintagunta (2017) to a situation of search with learning (see also Section 6.1). Using their experimental data, the authors find that, under the assumption of rational price expectations, consumers appear to search simultaneously, while the search patterns are consistent with sequential search conditional on consumers’ elicited price beliefs. This finding highlights the importance of the rational expectations assumption for search method identification.

5.1 De los Santos et al. (2012) One of the objectives of De los Santos et al. (2012) is to provide methods to empirically test between sequential and simultaneous search. The authors use data on web browsing and online purchases of a large panel of consumers, which allows them 28 If a consumer has different consideration sets under simultaneous and sequential search, he might also

end up purchasing a different product. 29 This is the case because, under sequential search, a consumer can stop searching when he gets a good

draw early on.



CHAPTER 4 Empirical search and consideration sets

to observe the online stores consumers visited as well as the store they ultimately bought from. Thus De los Santos et al. (2012) observe the sequence of visited stores, which is useful for distinguishing between sequential and simultaneous search. De los Santos et al. (2012) first investigate the homogeneous goods case with a market-wide price distribution. Recall that, under simultaneous search, a consumer samples all alternatives in his search sets before making a purchase decision. In contrast, under sequential search, a consumer stops searching as soon as he gets a price below his reservation price (see also Section 2.2). Since the latter implies that the consumer purchases from the last store he searched, revisits should not be observed when consumers search sequentially, while they may be observed when consumers search simultaneously (a consumer may revisit a previously searched store to make a purchase). Whether or not consumers revisit stores can be directly explored with data on search sequences. De los Santos et al. (2012) find that approximately one-third of consumers revisit stores – a finding that is inconsistent with a model of sequential search for homogeneous goods. Recall that the no-revisit property of the sequential search model for homogeneous goods does not necessarily apply to more general versions of the sequential search model, including models in which stores are differentiated. Specifically, if goods are differentiated, the optimal sequential search strategy is to search until a utility level is found that exceeds the reservation utility of the next-best alternative. As pointed out in Section 2, when products are differentiated, reservation utilities are generally declining, so a product that was not good enough early on in the search may pass the bar after a number of different products have been inspected, triggering a revisit to make a purchase. In the following, we show a simple example of how that can happen. Suppose a consumer wants to purchase one of five products denoted by A, B, C, D, and E. Table 2 shows (realized) utilities u and reservation utilities z for all five alternatives. Note that the alternatives are ranked in a decreasing order of their reservation utilities z. In this example, the consumer first searches A – the alternative with the highest reservation utility. Since the reservation utility of the next-best alternative B is larger than the highest utility realized so far, i.e. zB > uA , he continues to search. The consumer also decides to continue searching after having sampled options B and C since zC > max{uA , uB } and zD > max{uA , uB , uC }. However, after having searched alternative D, the consumer stops because the reservation utility of option E is smaller than the maximum utility realized so far, i.e., zE < max{uA , uB , uC , uD }.30 The maximum realized utility among the searched options is offered by alternative B with uB = 9. The consumer therefore revisits and purchases B. Thus, for differentiated goods, revisits can happen when consumers search sequentially. For the researcher, this means that evaluating the revisit behavior of consumers does not help to discriminate between simultaneous and sequential search in such a setting. De los Santos et al. (2012) point out that a more robust difference between sequential and simultaneous search is that the search behavior depends on observed search 30 Here the assumption of perfect recall made in Section 2.2 comes into play.

5 Testing between search methods

Table 2 Example. Option A B C D E

Utility (u) 7 9 8 6 11

Reservation utility (z) 15 13 11 10 7

outcomes under the former, but not under the latter search method. This insight forms the basis of a second test which uses the following idea: if consumers search sequentially and know the price distribution(s), they should be more likely to stop searching after getting a below-mean price draw as opposed to an above-mean price draw. The reason is as follows: due to the negative price coefficient in the utility function, a below-mean price draw results in an above-mean utility, i.e. u ≥ E[u], and an abovemean price draw results in a below-mean utility, i.e. u ≤ E[u]. Holding everything else constant, the consumer is more likely to stop the search with an above-mean utility draw than a below-mean utility draw since the stopping rule is more likely to be satisfied, i.e. the maximum utility among the searched options is more likely to be larger than the maximum reservation utility among the non-searched options. Based on this idea, the search method can be (empirically) identified as follows: if consumers search sequentially, consumers who get a below-mean price draw should be significantly more likely to stop searching after that below-mean price draw. To address store differentiation, this test can be carried out within a store, i.e. if a product has a high price relative to the store’s price distribution, sequentially searching consumers are more likely to continue searching, while the high price (relative to the store’s price distribution) should not affect the behavior of simultaneously searching consumers. In their empirical application, De los Santos et al. (2012) do not find any evidence for search decisions being dependent on observed prices – even within stores. They conclude that a simultaneous search model fits the data better than a sequential search model.

5.2 Honka and Chintagunta (2017) As stated previously, Honka and Chintagunta (2017) also study search method identification. This paper provides an analytical proof for search method identification. It also shows that it is not necessary to observe the sequences of consumers’ searches (which was a crucial assumption in De los Santos et al., 2012); only data on search sets, purchases, and purchase prices are needed for search method identification. Suppose prices follow some well-defined(potentially  company-specific and/or p consumer-specific) distributions and define Pr p < μij = λ, i.e. the probability that a price draw is below the expected price is λ. Further define the event X = 1 as a below-price expectation price draw and X = 0 as an above-price expectation price draw. Recall that, under simultaneous search, the consumer commits to searching a



CHAPTER 4 Empirical search and consideration sets

set Si consisting of ki companies. Then Honka and Chintagunta (2017) calculate the expected proportion of below-price expectation prices in a consumer’s consideration set of size ki as ⎤ ⎡ ki ki  λki 1  1 Xm ⎦ = E [Xm ] = = λ. (36) E⎣ ki ki ki m=1


Thus, if consumers search simultaneously, a researcher can expect λ% of the price draws in consumers’ consideration sets to be below and (1 − λ)% to be above the expected price(s). The crucial ingredients for identification are that the researcher p observes the means of the price distributions μij , the actual prices in consumers’ consideration sets pij and the probability of a price draw being below its mean λ. Under sequential search, Honka and Chintagunta (2017) show that, for both homogeneous and differentiated goods allowing for consumer- and company-specific search costs, the expected proportion of below-price expectation prices in consumers’ consideration sets of size one, X1 , is always larger than λ, i.e. X1 > λ, under the necessary condition that a positive number of consumers is observed to make more than one search. The characteristic price patterns for simultaneous and sequential search described above hold for all models that satisfy the following set of assumptions: (a) prices are the only source of uncertainty for the consumer which he resolves through search; (b) consumers know the distribution of prices and have rational expectations for these prices; (c) price draws are independent across companies; (d) there is no learning about the price distribution from observing other variables (e.g. advertising); (e) (search costs are sufficiently low so that) all consumers search at least once; and (f) consumers have non-zero search costs. Models that satisfy this set of assumptions include (1) models for homogeneous goods, (2) models for differentiated products, (3) models that include unobserved heterogeneity in preferences and/or search costs, (4) models with correlations among preferences and search costs, and (5) models with observed heterogeneity in price distribution means μij . On the other hand, the researcher would not find the characteristic price patterns when there is unobserved heterogeneity in the price distribution means as the researcher would no longer be able to judge whether a price draw is above or below the mean. Note also that the identification arguments in Honka and Chintagunta (2017) are based on the first moments of prices; in principle there could be identification rules based on higher moments as well. Lastly, Honka and Chintagunta (2017) discuss the modeling assumptions stated in the previous paragraph and to what extent the search method identification results depend on them. Assumptions (a) through (e) are standard in both the theoretical and empirical literature on price search. With regard to assumption (f) that consumers have non-zero search costs, note that search costs have to only the marginally larger than zero for search method identification to hold in all model specifications. Alternatively, if the researcher believes that the assumption of non-zero search costs is not appropriate in an empirical setting, search method identification is also given under

6 Current directions

the assumption that the search cost distribution is continuous, i.e. has support, from 0 to a positive number A > 0.31

6 Current directions In this section, we review some of the current directions in which the search literature is moving. As discussed in Section 2, this largely involves relaxing some of the rather stringent assumptions made in that section and/or developing new modeling approaches to understand more detailed data on consumer search especially from the online environment. We start by discussing papers which study search with learning, i.e. research that relaxes the assumption that consumers know the true price or match value distribution (Assumption 2 in Section 2.1). Instead, these papers try to characterize optimal consumer search in the presence of concurrent learning of the price or match value distribution. In the following section, we discuss papers which investigate search for multiple attributes, e.g. price and match value. In Section 6.3, we review papers that study the role of advertising in a market which is characterized by consumer search. In the three subsections that follow, we describe research that focuses on how consumers search in an online environment. This includes papers that look at search and rankings in Section 6.4, papers that try to quantify the optimal amount of information shown to consumers in Section 6.5, and papers that work with granular, i.e., extremely detailed, data on consumer search in Section 6.6. We conclude this section by discussing papers that investigate the intensive margin of search, i.e. search duration, allowing for re-visits of sellers in Section 6.7 (relaxing Assumptions 3 and 5 of Sections 2.1 and 2.2) and papers that incorporate dynamic aspects of consumer search in Section 6.8.

6.1 Search and learning A standard assumption in the consumer search literature is that consumers know the distribution from which they sample (Assumption 2 in Section 2.1). However, starting with Rothschild (1974), several theoretical papers have studied search behavior in the case that consumers have uncertainty about the distribution from which they sample (Rosenfield and Shapiro, 1981; Bikhchandani and Sharma, 1996). Although the empirical literature has largely followed Stigler’s (1961) initial assumption that the distribution is known, several recent studies have departed from it and assume that consumers learn about the distribution of prices or utilities while searching. To quickly recap, Rothschild (1974) studies optimal search rules when individuals are searching from unknown distributions and use Bayesian updating to revise their 31 Note that it is not required that the search cost distribution is continuous over its full range. It is only

required that it is continuous over the interval 0 to A > 0. The search method identification goes through when a search cost distribution has support e.g. from 0 to A and from B to C with C ≥ B > A > 0.



CHAPTER 4 Empirical search and consideration sets

priors when new information arrives. An important example in his paper is the case of a Dirichlet prior distribution: if prior beliefs follow a Dirichlet distribution, then the reservation value property continues to hold, i.e. the qualitative properties of optimal search rules that apply to models in which the distribution is known carry over to the case of an unknown distribution. Koulayev (2013) uses this result to derive a model of search for homogeneous products with Dirichlet priors that can be estimated using only aggregate data such as market shares and product characteristics. An attractive feature is that Dirichlet priors imply that search decisions can be characterized by the number of searches carried out to date as well as the best offer observed up to that point. This feature simplifies integrating out search histories (which are unobserved in Koulayev’s, 2013 application) and makes it possible to derive a closed-form expression for ex-ante buying probabilities. The Dirichlet distribution is a discrete distribution. In settings in which consumers search for a good product match, which is typically modeled as an IID draw from a continuous distribution, a continuous prior may be more appropriate. Bikhchandani and Sharma (1996) extend the Rothschild (1974) model to allow for a continuous distribution of offerings by using a Dirichlet process – a generalization of the Dirichlet distribution. Dirichlet process priors also imply that the only parts of the search history that matter for search decisions are the identity of the best alternative observed so far and the number of searches to date, which simplifies the estimation of such a model. The search model in Häubl et al. (2010) features learning of this type and is empirically tested using data from two experiments. De los Santos et al. (2017) also use this property to develop a method to estimate search costs in differentiated products markets. Specifically, the paper uses observed search behavior to derive bounds on a consumer’s search cost: if a consumer stops searching, this implies that she found a product with a higher realized utility than her reservation utility. If she continues searching, her search costs should have been lower than the gains from search relative to the best utility found so far. Learning enters through the equation describing the gains from search, i.e., ∞ W (u − uˆ it ) · h(u) du, (37) G(uˆ it ) = W + t uˆ it where h(u) is the density of the initial prior distribution, W is the weight put on the initial prior, and t represents the number of searches to date. The term WW+t differentiates Eq. (37) from the non-learning case and reflects the updating process of consumers. Intuitively, every time a utility lower than uˆ it is drawn, the consumer puts less weight on offers that exceed uˆ it . If t = 0 and h(u) corresponds to the utility distribution, then Eq. (37) equals the gains from search equation in the standard sequential search model. Dzyabura and Hauser (2017) study product recommendations and point out that, in an environment in which consumers are learning about their preference weights while searching, it may not be optimal to recommend the product with the highest probability to be chosen or the product with the highest option value. Instead, the optimal recommendation system encourages consumers to learn by suggesting products

6 Current directions

with diverse attribute levels, undervalued products, and products that are most likely to change consumers’ priors. Synthetic data experiments show that recommendation systems that have these elements perform well. Jindal and Aribarg (2018) conduct a lab experiment during which consumers search and learn the price distribution for a household appliance at the same time. The authors elicit each consumer’s belief about the price distribution before the first search and after every search the consumer decides to make. Jindal and Aribarg (2018) observe that consumers update their beliefs about the price distribution while searching. Using their experimental data, the authors show that not accounting for belief updating or assuming rational expectations biases search cost estimates. The direction of the bias depends on the position of prior beliefs relative to the true price distribution. Further, Jindal and Aribarg (2018) find that accounting for the means of the belief distribution mitigates the bias in search cost estimates substantially; the standard deviation of the belief distribution has a relative small impact on the distribution of search costs, and hence, the bias. Most of the previously mentioned papers (e.g. Rothschild, 1974; De los Santos et al., 2017; Dzyabura and Hauser, 2017) assume that consumers are learning while searching and then derive implications for optimal consumer search and/or supply side reactions for such an environment. Crucial, unanswered questions remain: with observational data, is it possible to identify whether consumers know or learn the distribution of interest while searching? What data would be necessary to do so and what would be the identifying data patterns? Furthermore, how quickly do consumers learn? Can companies influence the learning process and how? These and other unanswered questions related to concurrent search and learning represent ample opportunity for future research.

6.2 Search for multiple attributes So far, work discussed in this chapter has modeled consumers’ search for the specific value of a single attribute, e.g. price or match value. However, in practice, consumers might be searching to resolve uncertainty about more than one attribute.32 Researchers have developed different approaches to address this issue. For example, De los Santos et al. (2012) derive a closed-form solution for the benefit of searching for the case that consumers search simultaneously for both prices and epsilon. However, their solution requires the researcher to assume that both prices and epsilon follow EV Type I distributions and that both distributions are independent. Chen and Yao (2017) and Yao et al. (2017) pursue a different approach: while consumers search for multiple attributes in their sequential search models, in both papers, the authors assume that consumers know the joint distribution of these attributes. Consumers then search to resolve uncertainty about the (one) joint distribution. Thus Chen and Yao (2017) and Yao et al. (2017) model search for multiple attributes by 32 Match value search models sometimes describe the match value as a summary measure for multiple




CHAPTER 4 Empirical search and consideration sets

imposing an additional assumption, i.e. that consumers know the joint distribution of the attributes, which allows them to apply the standard solution for a sequential search model for a single attribute. While assuming that consumers know the joint distribution of multiple attributes or developing closed-form solutions under specific distributional assumptions are important steps forward, ample research opportunities remain to develop empirical models of search for multiple attributes with less stringent assumptions. On a different note, Bronnenberg et al. (2016) describe the values of attributes consumers observe while searching. However, an unanswered question is as to how many and which attributes a consumer searches for, i.e. resolves uncertainty about, versus simply observes their values because they are shown to the consumer by default. Field experiments might help shed light on this issue.

6.3 Advertising and search Researchers have long been interested in how advertising affects consumers’ decision-making in markets that are characterized by consumers’ limited information, i.e. markets in which consumers search and form consideration sets. The consideration set literature (Section 3.1) has often modeled consideration as a function of advertising and has often found advertising to have a significant effect on consideration, often larger than its effect on choice. For example, Terui et al. (2011) report advertising to significantly affect consideration but not choice. In the economics literature, by assumption, Goeree (2008) models advertising as affecting consideration but not choice. In a recent paper, using data on individual consumers’ awareness, consideration, and choices in the auto insurance industry over a time period of nine years, Tsai and Honka (2018) find advertising to significantly affect consumer awareness, but not conditional consideration or conditional choice.33 Further, the authors report that the advertising content that leads to consumers’ increased awareness is of non-informational nature, i.e. fun/humorous and/or brand name focused, implying that the effect on awareness is coming from non-informational content leading to better brand recall. Honka et al. (2017) develop a structural model which describes the three stages of the purchase process: awareness, consideration, and choice. It is one of the first papers that accounts for endogeneity – here: of advertising – within a model of consumer search. Potential endogeneity of advertising may arise in any or all stages of the purchase process and is addressed using the control function approach (Petrin and Train, 2010). The model is calibrated with individual-level data from the U.S. retail banking industry in which the authors observe consumers’ (aided) awareness and

33 Tsai and Honka (2018) observe unaided and aided awareness sets to, on average, contain 4.15 and 12.02, respectively. Looking at shoppers and non-shoppers separately, as expected, shoppers have larger unaided and aided awareness sets than non-shoppers. Consideration sets, on average, contain 3.12 brands (which includes the previous insurance provider).

6 Current directions

consideration sets as well as their purchase decisions. Consumers are, on average, aware of 6.8 banks and consider 2.5 banks. In modeling consumer behavior, Honka et al. (2017) view awareness as a passive occurrence, i.e. the consumer does not exert any costly effort to become aware of a bank. A consumer can become aware of a bank by, for example, seeing an ad or driving by a bank branch. Consideration is an active occurrence, i.e. the consumer exerts effort and incurs costs to learn about the interest rates offered by a bank. The consumer’s consideration set is thus modeled as the outcome of a simultaneous search process given the consumer’s awareness set (á la Honka, 2014). And finally, purchase is an active, but effortless occurrence in which the consumer chooses the bank which gives him the highest utility. The consumer’s purchase decision is modeled as a choice model given the consumer’s consideration set. Consideration and choice are modeled in a consistent manner by specifying the same utility function for both stages. This assumption is supported by Bronnenberg et al. (2016) who find that consumers behave similarly during the search and purchase stages. Honka et al. (2017) find that advertising primarily serves as an awareness shifter. While the authors also report that advertising significantly affects utility, the latter effect is much smaller in terms of magnitude. Advertising makes consumers aware of more options; thus consumers search more and find better alternatives than they would otherwise. In turn, this increases the market share of smaller banks and makes the U.S. banking industry more competitive. Further study of how advertising interacts with the process through which consumers search/consider products is a potentially very fruitful area for future research. For example, whether advertising in which explicit comparisons with competing products’ attributes and/or prices are made enlarges consumers’ consideration sets is a very interesting question (though how consumers may evaluate firms’ choices of which competitors they compare themselves against is a very interesting question for the literature on strategic information transmission). More broadly, understanding and quantifying the mechanisms through which different types of advertising affect the “purchase funnel” is likely to benefit from the availability of detailed data sets especially on online shopping behavior.

6.4 Search and rankings Most of the papers discussed so far assume that the order in which consumers obtain search outcomes is either random or, in the case of differentiated products, the outcome of a consumer’s optimal search procedure. However, in certain markets the order in which alternatives are searched may be largely determined by an intermediary or platform. For instance, search engines, travel agents, and comparison sites all present their search results ordered in a certain way and, as such, affect the way in which consumers search. In this section, we discuss several recent papers that study how rankings affect online search behavior. A particular challenge when estimating how rankings affect search is that rankings are endogenous. More relevant products are typically ranked higher by the intermediary. This endogeneity makes it difficult to estimate the causal effect of rankings



CHAPTER 4 Empirical search and consideration sets

on search: being ranked higher makes purchase or clicking more likely, which inflates the effect of relevance or quality. Ursu (2018) deals with this simultaneity problem by using data from a field experiment run by the online travel agent Expedia. Specifically, she compares click and purchase decisions of consumers who were either shown Expedia’s standard hotel ranking or a random ranking. Her findings suggest that rankings affect consumers’ search decisions in both settings, but conditional purchase decisions are only affected when hotels are ranked according to the Expedia ranking. This finding implies that the position effect of rankings is overestimated when rankings are not randomly generated. De los Santos and Koulayev (2017) focus on the intermediary’s decision of how to rank products. The authors propose a ranking method that optimizes click-through rates: it takes into account that, even though the intermediary typically knows very little about the characteristics of its consumers, the intermediary observes search refinement actions as well as other search actions. De los Santos and Koulayev (2017) find that their proposed ranking method almost doubles click-through rates for a hotel booking website in comparison to the website’s default ranking. Using an analytical model, Ursu and Dzyabura (2018) also study how intermediaries should rank products to maximize searches or sales. The authors incorporate the aspect that search costs increase for lower-ranked products. Contrary to common practice, Ursu and Dzyabura (2018) find that intermediaries should not always show the product with the highest expected utility first. Most online intermediaries give consumers the option to use search refinement tools when going through search results. These tools allow consumers to sort and filter the initial search rankings according to certain product characteristics, and can therefore significantly alter how products are ranked in comparison to the initial search results. Chen and Yao (2017) develop a sequential search model that incorporates consumers’ search refinement decisions. Their model is estimated using clickstream data from a hotel booking website. A key finding is that refinement tools result in 33% more searches leading a 17% higher utility of purchased products. The intersection of the research areas on rankings and consumer search offers plenty of opportunities for future work. For example, a more detailed investigation of the different types of rankings (e.g. search engine for information, intermediary offering products by different sellers, retailer selling own products) guided by different goals (e.g. maximize click-through, sales, profits) might result in different optimal rankings. Further, do ranking entities want consumers to search more or less? And lastly, some ranking entities offer many filtering and refinement tools, others do not at all or only a small number. The optimality of these decisions and reasons behind them remain open questions.

6.5 Information provision The Internet provides a unique environment in which companies can relatively easily and inexpensively change which and how much information to make (more or less) accessible to consumers and, through these website design decisions, to influence consumer search behavior.

6 Current directions

Gu (2016) develops a structural model of consumer search describing consumer behavior on the outer and the inner layer of a website. For example, on a hotel booking website, the outer layer refers to the hotel search results page and the inner layer refers to the individual hotel pages. Gu (2016) studies how the amount of information (measured by entropy) displayed on each layer affects consumer search behavior. The amount of information on the outer layer influences the likelihood of searching on the inner layer, i.e. clicking on the hotel page. More information on the outer layer reduces the need to visit the inner layer. At the same time, more information on the outer layer makes it more complex to process, which could decrease the likelihood of consumers searching. Thus there is a cost and a benefit to information on each layer. Gardete and Antill (2019) propose a dynamic model in which consumers search over alternatives with multiple characteristics. They apply their model to click-stream data from the website of a used car seller. In counterfactual exercises, Gardete and Antill (2019) predict the effects of different website designs and find that the amount of information and the characteristics of the information shown to consumers upfront affect search and conversion rates. More research is needed to understand how different information provision strategies affect consumer search and equilibrium outcomes. This is a domain where the burgeoning recent theoretical literature on information design (e.g. Kamenica and Gentzkow, 2011) may provide guideposts for empirical model building.

6.6 Granular search data In most studies that use online search data, the search data are either aggregated to the domain level or are restricted to only one retailer. For instance, the comScore data used in De los Santos et al. (2012) and De los Santos (2018) only allow the researcher to observe which domains have been visited, but not the browsing activity within a specific domain. This data limitation means that search is only partially observed, which could lead to biased estimates of search costs. Bronnenberg et al. (2016) use a much more detailed data set in which browsing is available at the URL-level. Their data track consumers’ online searches across and within different websites and are used to provide a detailed description of how consumers search when buying digital cameras. Their main finding is that consumers focus on a very small attribute space during the search and purchase process. This pattern can be interpreted as supporting the structural demand model assumption that consumers have the same utility for search and purchase. Moreover, consumers search more extensively than previous studies with more aggregate search data have found: consumers search, on average, 14 times prior to buying a digital camera. It is typically much easier to obtain search data for online than for brick-andmortar environments: online browsing data can be used as a proxy for search, whereas data on how consumers move within and across brick-and-mortar stores is typically not available. Seiler and Pinna (2017) use data obtained from radio-frequency identification tags that are attached to shopping carts to infer how consumers search within



CHAPTER 4 Empirical search and consideration sets

a grocery store.34 Specifically, the data tell them how much time consumers spent in front of a shelf, which is then used as a proxy for search duration. Using the consumers’ walking speed and basket size as instruments for search duration, the authors find that each additional minute spent searching results in a price paid that is lowered by $2.10. In a related paper, Seiler and Yao (2017) use similar path-track data to study the impact of advertising. They find that, even though advertising leads to more sales, it does not bring more consumers to the category being advertised. This finding suggests that advertising only increases conversion conditional on visiting the category. Moreover, Seiler and Yao (2017) find that search duration, i.e. time spent in front of a shelf, is not affected by advertising. This emerging literature shows that access to more and more detailed data on consumers’ actions on and across websites can enable researchers to get a detailed look into how consumers are actually searching for products. Further attempts to connect these exciting data sets to existing theoretical work on search, and collaboration between theorists and empiricists to build search models that better describe empirical patterns appear to be avenues for fruitful future research.

6.7 Search duration In many instances, consumers are observed to re-visit the same seller multiple times before making a purchase – a pattern that cannot be rationalized by the standard simultaneous and sequential search models (see Section 2). To explain this empirical pattern, Ursu et al. (2018) propose that consumers only partially resolve their uncertainty through a single search thus necessitating multiple searches of the same product if the consumer desires to know even more precisely about the product before making a purchase decision. Practically, Ursu et al. (2018) combine approaches from two streams of literature: the literature on consumer learning (using Bayesian updating) and the consumer search literature (more specifically, sequential search). Suppose a consumer wants to resolve the uncertainty about his match value with a vacation package. He has some prior belief or expectation of the match value. After searching once, e.g. by reading a review, the consumer receives a signal about the match value and updates his belief about it. The more the consumer searches, i.e. the more signals he receives, the less uncertainty he has about the match value. Thus, at each point in time during the search process, intuitively speaking, the consumer decides whether to search a previously unsearched option or to spend another search on a previously already searched option – allowing the model to capture the empirical observation of re-visits. A crucial question in this context relates to the characterization of optimal behavior in such a model of sequential search, i.e. how do consumers optimally decide 34 An alternative way to capture search behavior is by using eye-tracking equipment. For instance,

Stüttgen et al. (2012) use shelf images to analyze search behavior for grocery products and test whether consumers are using satisficing decision rules (see also Shi et al., 2013 for a study that analyzes information acquisition using eye-tracking data in an online environment).

6 Current directions

which product to search next (including the same), when to stop, and which one to purchase. For the standard sequential search model, Weitzman (1979) developed the well-known selection, stopping, and choice rules. For the model of sequential search with re-visits, Ursu et al. (2018) point to Chick and Frazier (2012) who developed analogue selection, stopping, and choice rules. The authors take the theoretical results by Chick and Frazier (2012) and apply them within their empirical context. The latter is a restaurant review site. Ursu et al. (2018) observe browsing behavior on this website and assume that a unit of search is represented by spending one minute on a restaurant page. Using this definition, the authors document that both the extensive and the intensive margins of search matter. They find that consumers search very few restaurants, but those that are searched are searched extensively. Studying search intensity is a new research subfield with plenty of opportunities for future research. Search intensity plays an important role not only in the online, but also in the offline environment. For example, many consumers take multiple test drives with the same car before making a purchase decision or look at a piece of furniture multiple times before buying it. These are big-ticket items and a better understanding of the search process might help both consumers and sellers. From a methodological perspective, collaboration between theorists and empiricists to rationalize search intensity as well as search sequences appears to be an interesting economic modeling challenge that is relatively unexplored.

6.8 Dynamic search All papers discussed so far assume static consumer behavior. This assumption is largely driven by data constraints: most papers have access to cross-sectional individual-level data or aggregate data. However, in many instances, consumers are observed to conduct multiple search spells over the course of a month or a year. For example, consumers might search for yoghurt once a week or for laundry detergent once a month. Seiler (2013) develops a structural model for storable goods in which it is costly for consumers to search, i.e. consumers have limited information. The model is estimated using data on purchases for laundry detergent. Seiler (2013) models the buying process as consisting of two stages: in the first stage, the consumer has to decide whether to search. Search is of the all-or-nothing type, i.e. the consumer can either search and obtain information on all products or not search at all. Only consumers who have searched can purchase. In the second stage, which is similar to a standard dynamic demand model for storable goods, the consumer then decides which product to buy. Dynamic models of demand are typically computationally burdensome and adding an extra layer in which consumers make search decisions will make the model more complex. To obtain closed-form solutions for the value functions in the search and purchase stages, Seiler (2013) adds a separate set of errors to each stage, which are unknown to the consumer before entering that stage. Moreover, he only allows for limited preference heterogeneity and the model does not allow price expectations to be affected by past price realizations, which can be a limitation for some markets.



CHAPTER 4 Empirical search and consideration sets

Seiler (2013) finds that search frictions play an important role in the market for laundry detergent: consumers are unaware of prices in 70% of shopping trips. Further, lower search costs by 50% would increase the elasticity of demand from −2.21 to −6.56. Pires (2016) builds on Seiler (2013), but instead of all-or-nothing search behavior, Pires (2016) allows consumers to determine which set of products to inspect using a simultaneous search strategy. The author only has access to data on purchases and adds a choice-set specific error term to the one-period flow utility from searching to deal with the curse of dimensionality of the simultaneous search model. The error term follows an EV Type I distribution and can be integrated out to obtain closedform expressions. Pires (2016) finds that search costs are substantial. Further, the effects of ignoring search on price elasticity depend on how often a product appears in consumers’ consideration sets. Since many products are purchased multiple times, understanding search behavior across purchase instances is clearly an important avenue for further research. This is an area where models of search come into contact with models of switching costs or customer inertia as well; thus we anticipate further work in this important interface.

7 Conclusions Although the theoretical literature on search is almost sixty years old, empirical and econometric work in the area continues to develop at a very fast pace thanks to the ever increasing availability of data on actual search behavior. Abundant data on consumers’ web searching/browsing/shopping behavior has become available through multiple channels on the Internet, along with the ability to utilize field experiments and randomized controlled trials. The availability of data is not restricted to the online/e-commerce domain; there is increasingly more data on consumer shopping patterns across and within physical retail stores. Data on consumer search and choice patterns is also becoming readily available in finance, insurance, and energy markets. On the public side of the economy, too, search and choice data are becoming increasingly more available (e.g. in educational choices – Kapor, 2016, health insurance – Ketcham et al., 2015, and public housing – van Dijk, 2019), creating more applications of econometric models that incorporate search and consideration, and also bring methodological challenges of their own. While our focus was on econometric methods to test between existing theories of search and to estimate the structural parameters of these models, the abundance and complexity of search data also pushes search theory to new frontiers. We also note that, while models of optimizing agents with rational expectations have been used fruitfully to make sense of search data, “behavioral” theories of search that combine insights from both economics and psychology may prove very fruitful in explaining observed patterns in the data. The mechanisms using which consumers search for and find the products they eventually purchase are important determinants of equilibrium outcomes in product


markets. While most of the empirical search models have focused on the demand side, investigating how firms optimize in the presence of consumer search needs further investigation. As we noted in Section 6.5, how firms choose to design their websites or advertising strategies to aid/direct search is a very important and relatively unexplored area of research, at least from the perspective of economics. Firms spend a lot of time and resources to design and optimize their websites or mobile interfaces. Quantifying how these design decisions affect market outcomes requires careful modeling of the (strategic) interaction between consumer and firm behavior. Another interesting area of research, highlighted in Section 3.1, is to investigate whether and how one can identify the presence of information/search frictions using conventional data sets (on prices, quantities, and product attributes) that do not contain information on search or credible variation in consumer information. Along with such investigations, efforts to supplement such conventional data sets with data on actual search or credible shifters of search/consumer information will help enrich the analyses that can be done with such data sets.

References Abaluck, Jason, Adams, Abi, 2018. What Do Consumers Consider Before They Choose? Identification from Asymmetric Demand Responses. Working Paper. Yale University. Allen, Jason, Clark, Robert, Houde, Jean-François, 2013. The effect of mergers in search markets: evidence from the Canadian mortgage industry. The American Economic Review 104 (10), 3365–3396. Allen, Jason, Clark, Robert, Houde, Jean-François, 2018. Search frictions and market power in negotiated price markets. Journal of Political Economy. Forthcoming. Allenby, Greg, Ginter, James, 1995. The effects of in-store displays and feature advertising on consideration sets. International Journal of Research in Marketing 12 (1), 67–80. Anderson, Simon, Renault, Régis, 2018. Firm pricing with consumer search. In: Corchón, L., Marini, M. (Eds.), Handbook of Game Theory and Industrial Organization, vol. 2. Edward Elgar, pp. 177–224. Andrews, Rick, Srinivasan, T., 1995. Studying consideration effects in empirical choice models using scanner panel data. Journal of Marketing Research 32 (1), 30–41. Aribarg, Anocha, Otter, Thomas, Zantedeschi, Daniel, Allenby, Greg, Bentley, Taylor, Curry, David, Dotson, Marc, Henderson, Ty, Honka, Elisabeth, Kohli, Rajeev, Jedidi, Kamel, Seiler, Stephan, Wang, Xin, 2018. Advancing non-compensatory choice models in marketing. Customer Needs and Solutions 5 (1), 82–91. Armstrong, Mark, 2017. Ordered consumer search. Journal of the European Economic Association 15 (5), 989–1024. Ausubel, Lawrence, 1991. The failure of competition in the credit card market. The American Economic Review 81 (1), 50–81. Baye, Michael, Morgan, John, Scholten, Patrick, 2006. Information, search, and price dispersion. In: Handbook on Economics and Information Systems, vol. 1, pp. 323–377. Ben-Akiva, Moshe, Lerman, Steven, 1985. Discrete Choice Analysis. MIT Press, Cambridge, MA. Berry, Steven, 1994. Estimating discrete-choice models of product differentiation. The Rand Journal of Economics 25 (2), 242–262. Berry, Steven, Levinsohn, James, Pakes, Ariel, 1995. Automobile prices in market equilibrium. Econometrica 63 (4), 841–890. Berry, Steven, Levinsohn, James, Pakes, Ariel, 2004. Differentiated products demand systems from a combination of micro and macro data: the new car market. Journal of Political Economy 112 (1), 68–105.



CHAPTER 4 Empirical search and consideration sets

Bikhchandani, Sushil, Sharma, Sunil, 1996. Optimal search with learning. Journal of Economic Dynamics and Control 20 (1–3), 333–359. Blevins, Jason, Senney, Garrett, 2019. Dynamic selection and distributional bounds on search costs in dynamic unit-demand models. Quantitative Economics. Forthcoming. Bronnenberg, Bart, Kim, Jun, Mela, Carl, 2016. Zooming in on choice: how do consumers search for cameras online? Marketing Science 35 (5), 693–712. Bronnenberg, Bart, Vanhonacker, Wilfried, 1996. Limited choice sets, local price response, and implied measures of price competition. Journal of Marketing Research 33 (2), 163–173. Brown, Jeffrey, Goolsbee, Austan, 2002. Does the Internet make markets more competitive? Evidence from the life insurance industry. Journal of Political Economy 110 (3), 481–507. Burdett, Kenneth, Judd, Kenneth, 1983. Equilibrium price dispersion. Econometrica 51 (4), 955–969. Carlson, John, Preston McAfee, R., 1983. Discrete equilibrium price dispersion. Journal of Political Economy 91 (3), 480–493. Chade, Hector, Smith, Lones, 2005. Simultaneous Search. Working Paper No. 1556. Department of Economics, Yale University. Chade, Hector, Smith, Lones, 2006. Simultaneous search. Econometrica 74 (5), 1293–1307. Chen, X., Hong, Han, Shum, Matt, 2007. Nonparametric likelihood ratio model selection tests between parametric likelihood and moment condition models. Journal of Econometrics 141 (1), 109–140. Chen, Yuxin, Yao, Song, 2017. Sequential search with refinement: model and application with click-stream data. Management Science 63 (12), 4345–4365. Chiang, Jeongwen, Chib, Siddhartha, Narasimhan, Chakravarthi, 1999. Markov chain Monte Carlo and models of consideration set and parameter heterogeneity. Journal of Econometrics 89 (1–2), 223–248. Chick, Stephen, Frazier, Peter, 2012. Sequential sampling with economics of selection procedures. Management Science 58 (3), 550–569. Ching, Andrew, Erdem, Tulin, Keane, Michael, 2009. The price consideration model of brand choice. Journal of Applied Econometrics 24 (3), 393–420. Choi, Michael, Dai, Anovia Yifan, Kim, Kyungmin, 2018. Consumer search and price competition. Econometrica 86 (4), 1257–1281. Clay, Karen, Krishnan, Ramayya, Wolff, Eric, 2001. Prices and price dispersion on the web: evidence from the online book industry. Journal of Industrial Economics 49 (4), 521–539. Clemons, Eric K., Hann, Il-Horn, Hitt, Lorin M., 2002. Price dispersion and differentiation in online travel: an empirical investigation. Management Science 48 (4), 534–549. Crawford, Gregory, Griffith, Rachel, Iaria, Alessandro, 2018. Preference Estimation with Unobserved Choice Set Heterogeneity Using Sufficient Sets. Working Paper. University of Zurich. De los Santos, Babur, 2018. Consumer search on the Internet. International Journal of Industrial Organization 58, 66–105. De los Santos, Babur, Hortaçsu, Ali, Wildenbeest, Matthijs R., 2012. Testing models of consumer search using data on web browsing and purchasing behavior. The American Economic Review 102 (6), 2955–2980. De los Santos, Babur, Hortaçsu, Ali, Wildenbeest, Matthijs R., 2017. Search with learning for differentiated products: evidence from E-commerce. Journal of Business & Economic Statistics 35 (4), 626–641. De los Santos, Babur, Koulayev, Sergei, 2017. Optimizing click-through in online rankings with endogenous search refinement. Marketing Science 36 (4), 542–564. Dong, Xiaojing, Morozov, Ilya, Seiler, Stephan, Hou, Liwen, 2018. Estimation of Preference Heterogeneity in Markets with Costly Search. Working Paper. Stanford University. Donna, Javier D., Pereira, Pedro, Pires, Pedro, Trindade, Andre, 2018. Measuring the Welfare of Intermediaries in Vertical Markets. Working Paper. The Ohio State University. Duffie, Darrell, Dworczak, Piotr, Zhu, Haoxiang, 2017. Benchmarks in search markets. The Journal of Finance 72 (5), 1983–2044. Dzyabura, Daria, Hauser, John R., 2017. Recommending Products When Consumers Learn Their Preference Weights. Working Paper. New York University. Elberg, Andres, Gardete, Pedro, Macera, Rosario, Noton, Carlos, 2019. Dynamic effects of price promotions: field evidence, consumer search, and supply-side implications. Quantitative Marketing and Economics 17 (1), 1–58.


Ershov, Daniel, 2018. The Effects of Consumer Search Costs on Entry and Quality in the Mobile App Market. Working Paper. Toulouse School of Economics. Fader, Peter, McAlister, Leigh, 1990. An elimination by aspects model of consumer response to promotion calibrated on UPC scanner data. Journal of Marketing Research 27 (3), 322–332. Gardete, Pedro, Antill, Megan, 2019. Guiding Consumers Through Lemons and Peaches: A Dynamic Model of Search over Multiple Attributes. Working Paper. Stanford University. Gilbride, Timothy, Allenby, Greg, 2004. A choice model with conjunctive, disjunctive, and compensatory screening rules. Marketing Science 23 (3), 391–406. Goeree, Michelle Sovinsky, 2008. Limited information and advertising in the US personal computer industry. Econometrica 76 (5), 1017–1074. Grennan, Matthew, Swanson, Ashley, 2018. Transparency and Negotiated Prices: The Value of Information in Hospital-Supplier Bargaining. Working Paper. University of Pennsylvania. Gu, Chris, 2016. Consumer Online Search with Partially Revealed Information. Working Paper. Georgia Tech. Gumbel, Emil, 1961. Bivariate logistic distributions. Journal of the American Statistical Association 56 (294), 335–349. Harrison, Glenn, Morgan, Peter, 1990. Search intensity in experiments. The Economic Journal 100 (401), 478–486. Häubl, Gerald, Dellaert, Benedict, Donkers, Bas, 2010. Tunnel vision: local behavioral influences on consumer decisions in product search. Marketing Science 29 (3), 438–455. Hauser, John, Wernerfelt, Birger, 1990. An evaluation cost model of consideration sets. Journal of Consumer Research 16 (4), 393–408. Hitsch, Guenter, Hortascu, Ali, Lin, Xiliang, 2017. Prices and Promotions in U.S. Retail Markets: Evidence from Big Data. Working Paper. University of Chicago. Hong, Han, Shum, Matthew, 2006. Using price distributions to estimate search costs. The Rand Journal of Economics 37 (2), 257–275. Honka, Elisabeth, 2014. Quantifying search and switching costs in the U.S. auto insurance industry. The Rand Journal of Economics 45 (4), 847–884. Honka, Elisabeth, Chintagunta, Pradeep, 2017. Simultaneous or sequential? Search strategies in the U.S. auto insurance industry. Marketing Science 36 (1), 21–40. Honka, Elisabeth, Hortaçsu, Ali, Vitorino, Maria Ana, 2017. Advertising, consumer awareness, and choice: evidence from the U.S. banking industry. The Rand Journal of Economics 48 (3), 611–646. Hortaçsu, Ali, Syverson, Chad, 2004. Product differentiation, search costs, and competition in the mutual fund industry: a case study of S&P 500 index funds. The Quarterly Journal of Economics 119 (2), 403–456. Hoxby, Caroline, Turner, Sarah, 2015. What high-achieving low-income students know about college. The American Economic Review 105 (5), 514–517. Janssen, Maarten, Moraga-Gonzalez, Jose Luis, Wildenbeest, Matthijs, 2005. Truly costly sequential search and oligopolistic pricing. International Journal of Industrial Organization 23 (5–6), 451–466. Jindal, Pranav, Aribarg, Anocha, 2018. The Importance of Price Beliefs in Consumer Search. Working Paper. University of North Carolina. Johnson, Norman, Kotz, Samuel, Balakrishnan, N., 1995. Continuous Univariate Distributions, 2nd ed. John Wiley and Sons Inc. Jolivet, Grégory, Turon, Hélène, 2018. Consumer search costs and preferences on the Internet. The Review of Economic Studies. Forthcoming. Kamenica, Emir, Gentzkow, Matthew, 2011. Bayesian persuasion. The American Economic Review 101 (6), 2590–2615. Kapor, Adam, 2016. Distributional Effects of Race-Blind Affirmative Action. Working Paper. Princeton University. Kawaguchi, Kohei, Uetake, Kosuke, Watanabe, Yasutora, 2018. Designing Context-Based Marketing: Product Recommendations Under Time Pressure. Working Paper. Hong Kong University of Science and Technology.



CHAPTER 4 Empirical search and consideration sets

Ketcham, Jonathan, Lucarelli, Claudio, Powers, Christopher, 2015. Paying attention or paying too much in Medicare part D. The American Economic Review 105 (1), 204–233. Kim, Jun, Albuquerque, Paulo, Bronnenberg, Bart, 2010. Online demand under limited consumer search. Marketing Science 29 (6), 1001–1023. Kim, Jun, Albuquerque, Paulo, Bronnenberg, Bart, 2017. The probit choice model under sequential search with an application to online retailing. Management Science 63 (11), 3911–3929. Kircher, Philipp, 2009. Efficiency of simultaneous search. Journal of Political Economy 117 (5), 861–913. Koulayev, Sergei, 2013. Search with Dirichlet priors: estimation and implications for consumer demand. Journal of Business & Economic Statistics 31 (2), 226–239. Koulayev, Sergei, 2014. Search for differentiated products: identification and estimation. The Rand Journal of Economics 45 (3), 553–575. McCall, John, 1970. Economics of information and job search. The Quarterly Journal of Economics 84 (1), 113–126. McFadden, Daniel, 1989. A method of simulated moments for estimation of discrete response models without numerical integration. Econometrica 57 (5), 995–1026. Mehta, Nitin, Rajiv, Surendra, Srinivasan, Kannan, 2003. Price uncertainty and consumer search: a structural model of consideration set formation. Marketing Science 22 (1), 58–84. Moraga-González, José Luis, Wildenbeest, Matthijs, 2008. Maximum likelihood estimation of search costs. European Economic Review 52 (5), 820–848. Moraga-González, José Luis, Sándor, Zsolt, Wildenbeest, Matthijs R., 2013. Semi-nonparametric estimation of consumer search costs. Journal of Applied Econometrics 28, 1205–1223. Moraga-González, José Luis, Sándor, Zsolt, Wildenbeest, Matthijs, 2015. Consumer Search and Prices in the Automobile Market. Working Paper. Indiana University. Moraga-González, José Luis, Sándor, Zsolt, Wildenbeest, Matthijs, 2018. Consumer Search and Prices in the Automobile Market. Working Paper. Indiana University. Morgan, Peter, Manning, Richard, 1985. Optimal search. Econometrica 53 (4), 923–944. Murry, Charles, Zhou, Yiyi, 2018. Consumer search and automobile dealer co-location. Management Science. Forthcoming. Nishida, Mitsukuni, Remer, Marc, 2018. The determinants and consequences of search cost heterogeneity: evidence from local gasoline markets. Journal of Marketing Research 55 (3), 305–320. Petrin, Amil, Train, Kenneth, 2010. A control function approach to endogeneity in consumer choice models. Journal of Marketing Research 47 (1), 3–13. Pires, Tiago, 2016. Costly search and consideration sets in storable goods markets. Quantitative Marketing and Economics 14 (3), 157–193. Ratchford, Brian, 1980. The value of information for selected appliances. Journal of Marketing Research 17 (1), 14–25. Reinganum, Jennifer, 1979. A simple model of equilibrium price dispersion. Journal of Political Economy 87 (4), 851–858. Roberts, John, Lattin, James, 1991. Development and testing of a model of consideration set composition. Journal of Marketing Research 28 (4), 429–440. Roberts, John, Lattin, James, 1997. Consideration: review of research and prospects for future insights. Journal of Marketing Research 34 (3), 406–410. Rosenfield, Donald, Shapiro, Roy, 1981. Optimal adaptive price search. Journal of Economic Theory 25 (1), 1–20. Rothschild, Michael, 1974. Searching for the lowest price when the distribution of prices is unknown. Journal of Political Economy 82 (4), 689–711. Roussanov, Nikolai, Ryan, Hongxun, Wei, Yanhao, 2018. Marketing Mutual Funds. Working Paper. University of Pennsylvania. Salz, Tobias, 2017. Intermediation and Competition in Search Markets: An Empirical Case Study. Working Paper. Columbia University. Sanches, Fabio, Junior, Daniel Silva, Srisuma, Sorawoot, 2018. Minimum distance estimation of search costs using price distribution. Journal of Business & Economic Statistics 36 (4).


Seiler, Stephan, 2013. The impact of search costs on consumer behavior: a dynamic approach. Quantitative Marketing and Economics 11 (2), 155–203. Seiler, Stephan, Pinna, Fabio, 2017. Estimating search benefits from path-tracking data: measurement and determinants. Marketing Science 36 (4), 565–589. Seiler, Stephan, Yao, Song, 2017. The impact of advertising along the conversion funnel. Quantitative Marketing and Economics 15, 241–278. Shi, Savannah, Wedel, Michel, Pieters, Rik, 2013. Information acquisition during online decision making: a model-based exploration using eye-tracking data. Management Science 59 (5), 1009–1026. Shocker, Allen, Ben-Akiva, Moshe, Boccara, Bruno, Nedungadi, Prakash, 1991. Consideration set influences on consumer decision-making and choice: issues, models, and suggestions. Marketing Letters 2 (3), 181–197. Siddarth, S., Bucklin, Randolph, Morrison, Donald, 1995. Making the cut: modeling and analyzing choice set restriction in scanner panel data. Journal of Marketing Research 32 (3), 255–266. Sorensen, Alan, 2000. Equilibrium price dispersion in retail markets for prescription drugs. Journal of Political Economy 108 (4), 833–850. Stigler, George, 1961. The economics of information. Journal of Political Economy 69 (3), 213–225. Stüttgen, Peter, Boatwright, Peter, Monroe, Robert, 2012. A satisficing choice model. Marketing Science 31 (6), 878–899. Terui, Nobuhiko, Ban, Masataka, Allenby, Greg, 2011. The effect of media advertising on brand consideration and choice. Marketing Science 30 (1), 74–91. Tsai, Yi-Lin, Honka, Elisabeth, 2018. Non-Informational Advertising Informing Consumers: How Advertising Affects Consumers’ Decision-Making in the U.S. Auto Insurance Industry. Working Paper. UCLA. Ursu, Raluca, 2018. The power of rankings: quantifying the effect of rankings on online consumer search and purchase decisions. Marketing Science 37 (4), 530–552. Ursu, Raluca, Dzyabura, Daria, 2018. Product Rankings with Consumer Search. Working Paper. New York University. Ursu, Raluca, Wang, Qingliang, Chintagunta, Pradeep, 2018. Search Duration. Working Paper. New York University. van Dijk, Winnie, 2019. The Socio-Economic Consequences of Housing Assistance. Working Paper. University of Chicago. Van Nierop, Erjen, Bronnenberg, Bart, Paap, Richard, Wedel, Michel, Franses, Philip Hans, 2010. Retrieving unobserved consideration sets from household panel data. Journal of Marketing Research 47 (1), 63–74. Vishwanath, Tara, 1992. Parallel search for the best alternative. Economic Theory 2 (4), 495–507. Weitzman, Martin, 1979. Optimal search for the best alternative. Econometrica 47 (3), 641–654. Wildenbeest, Matthijs, 2011. An empirical model of search with vertically differentiated products. The Rand Journal of Economics 42 (4), 729–757. Wolinsky, Asher, 1986. True monopolistic competition as a result of imperfect information. The Quarterly Journal of Economics 101 (3), 493–512. Woodward, Susan, Hall, Robert, 2012. Diagnosing consumer confusion and sub-optimal shopping effort: theory and mortgage-market evidence. The American Economic Review 102 (7), 3249–3276. Yao, Song, Wang, Wenbo, Chen, Yuxin, 2017. TV channel search and commercial breaks. Journal of Marketing Research 54 (5), 671–686. Zettelmeyer, Florian, Morton, Fiona Scott, Silva-Risso, Jorge, 2006. How the Internet lowers prices: evidence from matched survey and automobile transaction data. Journal of Marketing Research 43 (2), 168–181. Zhang, Jie, 2006. An integrated choice model incorporating alternative mechanisms for consumers’ reactions to in-store display and feature advertising. Marketing Science 25 (3), 278–290. Zhang, Xing, Chan, Tat, Xie, Ying, 2018. Price search and periodic price discounts. Management Science 64 (2), 495–510.



Digital marketing✩

5 Avi Goldfarba,b,∗ , Catherine Tuckerc,b

a Rotman

School of Management, University of Toronto, Toronto, ON, Canada b NBER, Cambridge, MA, United States c MIT Sloan School of Management, Cambridge, MA, United States ∗ Corresponding author: e-mail address: [email protected]

Contents 1 Reduction in consumer search costs and marketing ...................................... 1.1 Pricing: Are prices and price dispersion lower online? ....................... 1.2 Placement: How do low search costs affect channel relationships? ....... 1.3 Product: How do low search costs affect product assortment? ............. 1.4 Promotion: How do low search costs affect advertising?..................... 2 The replication costs of digital goods is zero .............................................. 2.1 Pricing: How can non-rival digital goods be priced profitably?.............. 2.2 Placement: How do digital channels – some of which are illegal – affect the ability of information good producers to distribute profitably? ......... 2.3 Product: What are the motivations for providing digital products given their non-excludability? ........................................................... 2.4 Promotion: What is the role of aggregators in promoting digital goods? ... 3 Lower transportation costs..................................................................... 3.1 Placement: Does channel structure still matter if transportation costs are near zero? ...................................................................... 3.2 Product: How do low transportation costs affect product variety?.......... 3.3 Pricing: Does pricing flexibility increase because transportation costs are near zero? ...................................................................... 3.4 Promotion: What is the role of location in online promotion? ............... 4 Lower tracking costs............................................................................ 4.1 Promotion: How do low tracking costs affect advertising? ................... 4.2 Pricing: Do lower tracking costs enable novel forms of price discrimination?..................................................................... 4.3 Product: How do markets where the customer’s data is the ‘product’ lead to privacy concerns?......................................................... 4.4 Placement: How do lower tracking costs affect channel management? ... 5 Reduction in verification costs ................................................................ 5.1 Pricing: How willingness to pay is bolstered by reputation mechanisms .. 5.2 Product: Is a product’s ‘rating’ now an integral product feature? ..........

261 261 263 264 265 267 267 268 268 269 269 270 271 272 272 273 273 276 277 278 278 278 279

✩ This builds heavily on our Journal of Economic Literature paper, ‘Digital Economics.’ Handbook of the Economics of Marketing, Volume 1, ISSN 2452-2619, https://doi.org/10.1016/bs.hem.2019.04.004 Copyright © 2019 Elsevier B.V. All rights reserved.



CHAPTER 5 Digital marketing

5.3 Placement: How can channels reduce reputation system failures? ........ 5.4 Promotion: Can verification lead to discrimination in how goods are promoted? .......................................................................... 6 Conclusions ...................................................................................... References............................................................................................

280 280 281 282

Digital technology is the representation of information in bits, reducing the costs of collecting, storing, and parsing customer data. Such technologies span TCP/IP and other communications standards, improvements in database organization, improvements in computer memory, faster processing speeds, fiber optic cable, wireless transmission, and advances in statistical reasoning. These new digital technologies can be seen as reducing the costs of certain marketing activities. Digital marketing explores how traditional areas of marketing such as pricing, promotion, product, and placement change as certain costs fall substantially, perhaps approaching zero. Using the framework in our recent summary of the digital economics literature (Goldfarb and Tucker, 2019), we emphasize a shift in five different costs in addressing the needs of customers. 1. 2. 3. 4. 5.

Lower search costs for customers. Lower replication costs for certain digital goods. Lower transportation costs in transporting digital goods. Lower tracking costs enabling personalization and targeting. Lower verification costs of customers’ wishes and firms’ reputations.

We argue that each of these costs had the distinction of affecting marketing earlier and more dramatically than many other firm functions or sectors. As a consequence, marketing has become a testing lab for understanding how these shift in costs may affect the broader economy. This link between marketing and economics is important because each of these shifts in costs draws on familiar modeling frameworks from economics. For example, the search cost literature goes back to Stigler (1961). Search costs are lower in digital environments, enabling customers to find products and firms to find customers. Non-rivalry is another key concept, as digital goods can be replicated at zero cost. Transportation cost models, such as the Hotelling Model, provide a useful framework for the literature on the low cost of transportation of digital goods. Digital technologies make it easy to track any one consumer’s behavior, a theme of advertising models at least since Grossman and Shapiro (1984). Last, information models that emphasize reputation and trust help frame research that shows that digitization can make verification easier. Early work in digital economics and industrial organization emphasized the role of lower costs (Shapiro and Varian, 1998; Borenstein and Saloner, 2001; Smith et al., 2001; Ellison and Ellison, 2005). Goldfarb and Tucker (2019) analyzed how these shifts have been studied in the economics literature. We aim to focus on the extent to which quantitative marketing has led, and has been the first empirical testing ground for, many of these changes. As such, we will focus on work in quantitative marketing aiming to understand the effect of technology. In doing so, we will not emphasize

1 Reduction in consumer search costs and marketing

work from the consumer behavior literature or methodology-focused work from the marketing statistics literature. We will also not emphasize studies in the marketing strategy literature that document correlations which are of managerial importance, rather than measuring the causal effects of digital technologies.

1 Reduction in consumer search costs and marketing Search costs matter in marketing because they represent the costs consumers incur looking for information regarding products and services. The most important effect of lower search costs with respect to digital marketing is that it is easier to find and compare information about potential products and services online than offline. Many of the earliest reviews about the impact of the internet on the economy emphasized low search costs in the retail context (Borenstein and Saloner, 2001; Bakos, 2001) and the resulting impact on prices, price dispersion, and inventories. These papers built on a long-established economic literature on search costs (Stigler, 1961; Diamond, 1971; Varian, 1980). Recent work in marketing has examined the search process in depth, documenting the clickstream path and underlying search strategies (Bronnenberg et al., 2016; Honka and Chintagunta, 2017).

1.1 Pricing: Are prices and price dispersion lower online? Perhaps the dominant theme in the early literature was the impact of low search costs on prices and price dispersion. Brynjolfsson and Smith (2000) hypothesized that low internet search costs should lower both prices and price dispersion. They empirically tested these ideas, comparing the prices of books and CDs online and offline. They found that online prices were lower. Similarly, Brown and Goolsbee (2002) showed that insurance prices are lower online and Orlov (2011) found that airline prices are lower online. A series of related studies (Zettelmeyer et al., 2001; Scott Morton et al., 2003; Zettelmeyer et al., 2006) showed how digitization reduced automobile prices, though not equally for all types of consumers. While prices fell, the results of the literature on price dispersion have been more mixed. Brynjolfsson and Smith (2000) show substantial price dispersion online. Nevertheless, they find that online price dispersion is somewhat lower than offline price dispersion. Baye and Morgan (2004) emphasize persistently high levels of price dispersion online. Orlov (2011) suggests that online price dispersion is higher. The persistence of price dispersion is a puzzle. Broadly, the literature gives two main answers. First, the earlier economics literature has emphasized that retailers differ, so the service provided for the same item differs across retailers. Firms with stronger brands command higher prices, though this has been decreasing somewhat over time (Waldfogel and Chen, 2006). This decline in importance of brands in the digital environment, as shown in Hollenbeck (2018), is related as we discuss to the reduction in verification costs.



CHAPTER 5 Digital marketing

Second, as a counterpoint to the notion that there are exogenously given differences in seller quality that formed the basis of the early economics literature – the marketing literature has emphasized the extent to which search can be influenced by the seller. In other words, in marketing we recognize that search costs are endogenous and a reflection of a firm’s marketing strategy. Honka (2014) and De los Santos et al. (2012) provide surprisingly large estimates of the cost of each click in the online context. By forcing customers to conduct an extra click or two, sellers can increase the relative cost of search in areas where they are weak. For example, a highquality, high-price firm might make it easy to compare product quality but difficult to find prices. Chu et al. (2008) show that price sensitivity is lower in online grocery compared to offline grocery. Fradkin (2017) has shown a similar phenomenon in the context of Airbnb. A number of scholars have shown that such endogenous increases in search costs can be sustained in equilibrium (Ellison and Ellison, 2009) and profitable (Hossain and Morgan, 2006; Dinerstein et al., 2018; Moshary et al., 2017). Ellison and Ellison (2009) showed how firms can obfuscate prices. They emphasize a setting where search costs should be very low: An online price comparison website. They show that retailers that display prices on that website emphasize their relatively low priced products. Then, when consumers click the link and arrive at the retailer’s own website, they are shown offers for higher prices and higher margin goods. Thus, price dispersion is low at the price comparison website where search costs are low, but dispersion is high where comparison is more difficult. More recently, Moshary et al. (2017) demonstrate the effectiveness of similar price obfuscation in the context of a massive field experiment at StubHub. The experiment compared purchase prices and demand estimates when the service fees for using StubHub were shown early in the search process versus immediately before purchase. The experiment showed that customers were less sensitive to the same fee when it was shown late in the process. The company deliberately made some price information more difficult to find, and this increased quantity demanded at the same price. Another area where search costs are endogenous to firm marketing strategy reflects the use of devices. Firms recognize that tablets and mobile devices, with smaller screens, may facilitate this process of restricting the information that consumers see initially (Xu et al., 2017; Ghose et al., 2013). A developing area of marketing is trying to understand, given this different environment, how best to present price information to consumers to maximize profits in a mobile environment (Andrews et al., 2015; Fong et al., 2015). The early online price dispersion literature and the more recent literature demonstrating endogenous online search costs show where a close study of marketing contexts has been able to add nuance to a puzzle noted in the economics literature, by exploring how firms can increase search costs for consumers in a digital environment.

1 Reduction in consumer search costs and marketing

1.2 Placement: How do low search costs affect channel relationships? Reduced search costs facilitate exchange more generally, often enabled by large digital platforms. Many major technology firms can be seen as platform-based businesses. For example, Google and Facebook are platforms for advertisers and buyers. Jullien (2012) highlighted that digital markets give rise to platforms because of low search costs that facilitate matching and enable trade. Horton and Zeckhauser (2016) emphasize that many large digital platforms are driven by low search costs that enable efficient use of unused capacity for durable goods. This emphasis on unused capacity, and the need to match supply and demand, means that much research takes a market design perspective (Einav et al., 2018). Cullen and Farronato (2016) emphasize the challenges of matching supply and demand over time and the importance of economies of scale in matching. Zervas et al. (2017) emphasize how supply changes in response to changes in demand. Bapna et al. (2016) emphasize that platform design can also be informed by consumer theory. These platforms often provide an alternative type of distribution channel, through which sellers can reach buyers. This can enable new markets and affect incumbents. For example, several papers have examined the accommodation industry (Fradkin, 2017; Farronato and Fradkin, 2018; Zervas et al., 2017). Zervas et al. (2017) examine how the introduction of Airbnb as a channel for selling accommodations reduced demand in the incumbent hotel industry in a particular way. Airbnb provided a channel for selling temporary accommodation. This enabled accommodations to go on and off the market as demand fluctuated. Consequently, the impact of Airbnb is largest in periods of peak demand (such as the SXSW festival in Austin, Texas). In these periods, hotel capacity is constrained. Airbnb ‘hosts’ play a role in providing additional capacity. This means that hotel prices do not rise as much. Digital platforms serve as distribution channels in a wide variety of industries, including airlines (Dana and Orlov, 2014), books (Ellison and Ellison, 2017), food trucks (Anenberg and Kung, 2015), entertainment (Waldfogel, 2018), and cars (Hall et al., 2016). In many of these cases, a key role of online platforms is to provide an additional channel to overcome capacity constraints (Farronato and Fradkin, 2018). These constraints may be regulatory, as in the case of limited taxi licensing, related to fixed costs and technological limits as in the case of YouTube as a substitute for television, or both, as in the case of accommodation, where hotel rooms have high fixed costs and short term rentals are constrained by regulation. Given that a key role of these online platforms is to overcome capacity constraints in offline distribution channels, this provides a structure for understanding where new online platforms may arise. They will likely appear in places where existing distribution channels generate capacity constraints, particularly in the presence of large demand fluctuations. Furthermore, it provides a structure for identifying which existing channels and incumbent sellers will be most affected by online platforms: Those in which capacity constraints generate a key source of their profits. As Farronato and Fradkin (2018) show, hotels lost their ability to charge unusually high prices during periods of peak demand.



CHAPTER 5 Digital marketing

In summary, digital platforms facilitate a reduction in search costs. This creates an opportunity for sellers working at a small scale to find buyers. By enabling an influx of sellers, online platforms overcome capacity constraints, creating new opportunities for sellers, new benefits to buyers, and new threats to the existing distribution channels and the larger incumbent sellers. Much of the initial literature on platform or two-sided networks was led by economists inspired by antitrust litigation in credit cards (Rochet and Tirole, 2003; Armstrong, 2006). However, recently the literature has exploded in marketing because so many large platforms are primarily marketing channels – such as Amazon, Facebook, and Google. This means that the digital marketing literature is at the core of much of the debate about the extent to which such platforms represent a challenge for antitrust regulators (Chiou and Tucker, 2017b).

1.3 Product: How do low search costs affect product assortment? Anderson (2006) emphasized that the internet increases the purchase of niche or ‘long tail’ products relative to mainstream or superstar products. Consistent with this hypothesis, Brynjolfsson et al. (2011) find that the variety of products available and purchased online is higher than offline. Zentner et al. (2013) use a quasi-experimental estimation strategy to show that consumers are more likely to rent niche movies online and blockbusters offline. Datta et al. (2018) demonstrate that the move to streaming, rather than purchasing, music has led to a wider variety of music consumption and increased product discovery, which in turn increases the variety of music available (Aguiar and Waldfogel, 2018). Zhang (2018) links this discovery of relatively unknown products to low search costs. Empirical evidence suggests that this increase in variety increased consumer surplus (Brynjolfsson et al., 2003). Furthermore, Quan and Williams (2018) suggest that the increase in the variety of products purchased by consumers has been overestimated by the literature. In particular, they note that tastes are spatially correlated, and examine the consequences of spatially correlated tastes on the distribution of product assortment both online and offline. The key finding is that offline product assortment has been mis-measured, because products that might appear to be rarely purchased in a national sample could still have sufficient local demand in certain markets that they would be available. Drawing on this insight, they build a structural model of demand and show that the welfare effects of the internet through the long tail are much more modest than many previous estimates. These relatively low welfare benefits of the long tail or the benefits of increased variety are consistent with Ershov’s (2019) research in the context of online software downloads from the Google Play store, which emphasizes the general benefits of a reduction in search costs for consumers. While much of the popular discussion has emphasized the long tail, the effect of search costs on product assortment is ambiguous. If there are vertically differentiated products, low search costs mean that consumers will all be able to identify the best product. Bar-Isaac et al. (2012) provide a theoretical framework that combines superstar and long tail effects as search costs fall, demonstrating that lower search costs hurt middle-tier products while helping extremes. Elberse and Eliashberg (2003) document both effects in the entertainment industry.

1 Reduction in consumer search costs and marketing

As noted above, search costs can be endogenously chosen by firms. Recommendation engines are one tool through which firms choose which attributes to emphasize, lowering search costs in some dimensions and not others. Fleder and Hosanagar (2009) show that simple changes to recommendation engine algorithms can bias purchases toward superstar or long tail effects. Superstar effects occur when the recommendation engine primarily suggests ‘people who bought this also bought’. In contrast, long tail effects occur when the engine instead suggests ‘people who bought this disproportionately bought’. Consistent with this framing, Zhang and Liu (2012) and Agrawal et al. (2015) show how recommendation engines can lead to a small number of products receiving the most attention when they focus on showing which products are most popular. In contrast, Tucker and Zhang (2011) provide an example in which a recommendation engine which highlights the popularity of a digital choice has asymmetrically large effects for niche products. This occurs for a different reason than the one highlighted in Fleder and Hosanagar (2009). In this case, the release of popularity information allowed niche sellers to appear to be relatively popular and consequently signal their quality or general attractiveness. Overall, reduced search costs appear to increase product assortment while also increasing sales at the top of the distribution. We have empirical evidence of both long tail and superstar effects, probably at the expense of products in the middle of the distribution. While the variety of products offered has increased, Quan and Williams (2018) highlight that the welfare consequences of this appear to be small in the context of a particular set of online products. This contrasts with the evidence summarized in Waldfogel (2018) who argues that digitization has led to a substantial increase in consumer welfare in the entertainment industry. Currently, the literature does not have a systematic structure for identifying when increased product assortment will have a large welfare impact. The evidence presented by both Quan and Williams (2018) and Waldfogel (2018) is compelling. It suggests that the particular characteristics of the product category will determine welfare effects. It remains an open research question what those characteristics might be, and this is something we feel that marketing contexts are well able to exploit.

1.4 Promotion: How do low search costs affect advertising? Advertising is often modeled as a process for facilitating search (Bagwell, 2007). Online search costs affect advertising in a variety of ways. For example, low search costs online can also change the nature and effectiveness of offline advertising. Joo et al. (2014) show that television advertising leads to online search. Such searches can in turn lead to better information about products and higher sales. In other words, the ability to search online can make other advertising more effective, and it can enable advertisers to include an invitation to search in their messaging. Modeling advertising as a search process is particularly useful in the context of search engine advertising. This is advertising that responds directly to consumer search behavior. Consumers enter what they are looking for into the search engine. Advertisers respond to that statement of intent. Search engine advertising allows both



CHAPTER 5 Digital marketing

online and offline advertisers to find customers. Kalyanam et al. (2017) show how search engine ads affect offline stores. Even within search engine advertising, search costs vary. Ghose et al. (2012) demonstrate the importance of rank in search outcomes. Higher ranked products get purchased more. Narayanan and Kalyanam (2015) and Jeziorski and Moorthy (2017) both document that rank matters in search engine advertising in particular, but that this effect varies across advertisers and contexts. Despite the widespread use of search advertising, there is some question about whether search advertising is effective at all. Li et al. (2016) discuss how industry attributes consumer purchases to particular advertising. Search engine advertising appears most effective because people who click on search ads are very likely to end up purchasing and because the click on the search ad is often the ‘last click’ before purchase. Many industry models treat this last click as the most valuable and hence search engine advertising is seen as particularly effective. Li et al. (2016) argue that the industry models overestimate the effectiveness of search ads. Blake et al. (2015) describe why the effectiveness of search advertising may be overestimated. They emphasize the importance of the counterfactual situation where the ad did not appear. If the search engine user would click the algorithmic link instead of the advertisement, then the advertiser would receive the same result for free. The paper shows the result of a field experiment conducted by eBay, in which eBay stopped search engine advertising in a randomly selected set of local markets in the United States. Generally, eBay sales did not fall in markets without search engine advertising compared to markets with search engine advertising. In the absence of search ads, it appears that users clicked the algorithmic links and purchased at roughly the same rate. This was particularly true of the branded keyword search term ‘eBay’. In other words, careful analysis of the counterfactual suggested that search engine advertising generally did not work (except in a small number of specialized situations). This research led eBay to substantially reduce its search engine advertising. Simonov et al. (2018a) revisit the effectiveness of search engine advertising and focus on advertisements for the branded keyword, but for less prominent advertisers than eBay. Using data from search results at Bing’s search engine, they replicate the result that search engine advertising is relatively ineffective for very well known brands. They then demonstrate that search engine advertising is effective for less well known brands, particularly for those that do not show up high in the algorithmic listings. Overall, these papers have shown that understanding the search process – for example through examining heterogeneity in the counterfactual options when search advertising is unavailable – is key to understanding when advertising serves to lower search costs. Coviello et al. (2017) also find that search advertising is effective for a less well-known brand. Simonov et al. (2018b) show that competitive advertising provides a further benefit of search engine advertising: If competitors are bidding on a keyword (even if that keyword is a brand name), then there can be a benefit to paying to search engine advertising even for advertisers who appear as the top algorithmic link. In other words, Blake et al. (2015), Simonov et al. (2018a), and Simonov et al. (2018b) together demonstrate what might seem obvious ex post: Search advertis-

2 The replication costs of digital goods is zero

ing meaningfully lowers search costs for products that are relatively difficult to find through other means. This finding is nevertheless important. Search advertising is a multi-billion dollar industry and many marketers appear to have been mis-attributing sales to the search advertising channel. These three papers provide a coherent picture of how search engine advertising works. The open questions relate to how changes in the nature of search advertising – and the addition of new search channels such as mobile devices and personal assistants – might affect this picture. As we move off the larger screens into more limited bandwidth devices, then search costs may rise and even strong brands may benefit from search advertising. Tools and questions from economics – such as thinking about the right counterfactual – have led to an extensive and important literature on search engines that spans marketing and economics. Even though search engines are so recent, the speed with which this literature has sprung up reflects the growing importance of search engines and other search mechanisms in the digital economy.

2 The replication costs of digital goods is zero Digital goods are non-rival. They can be consumed by one person without reducing the amount or quality available to others. Fire is a non-rival good. If one person starts a fire, they can use it to light someone else’s fire without diminishing their own. The non-rival nature of digital goods leads to important implications for marketing, particularly with respect to copyright and privacy. The internet is, in many ways, a “giant, out of control copying machine” (Shapiro and Varian, 1998). This means that a key challenge for marketers in the era of digitization is controlling product reproduction – free online copying – by consumers.

2.1 Pricing: How can non-rival digital goods be priced profitably? Non-rival goods create pricing challenges. If customers can give their purchases away without decreasing the quality of what they bought, this creates challenges to the ability to price positively. The initial response by many producers of digital products was both legal (through copyright enforcement) and technological (through digital rights management). The effectiveness of such policies on consumer purchases is theoretically ambiguous, and the empirical evidence is mixed (Varian, 2005; Vernik et al., 2011; Danaher et al., 2013; Li et al., 2015; Zhang, 2018). Non-rivalry can lead to opportunities for price discrimination. Lambrecht and Misra (2016) examine price discrimination in sports news. In this context, the highest willingness to pay customers appear all year, while casual fans primarily read news in-season. Therefore, it is profitable for a sports website to provide more free articles during periods of peak demand. During the off-season, more content should require a subscription because that is when the highest value customers appear. Rao and Hartmann (2015) examine price discrimination in digital video, comparing options to rent



CHAPTER 5 Digital marketing

or buy a digital movie. The paper shows that in the zero marginal cost digital context, dynamic considerations play an important role. The marketing literature has long being focused on tactical and practical questions about how to price (Rao, 1984). Therefore, it is not surprising that scholars at the boundary between marketing and economics have been exploring the new frontier question about how to price non-rival digital goods.

2.2 Placement: How do digital channels – some of which are illegal – affect the ability of information good producers to distribute profitably? Digital channels affect the ability of the producers of information goods to distribute profitably. For example, music industry revenue began to fall in 1999 and this has been widely blamed on the impact of digitization generally and free online copying in particular (Waldfogel, 2012). This leads to a question of optimal restrictions on free online copying by governments through copyright and by firms through digital rights management. While the direct effect of free online copying is to reduce revenues, such free copying may induce consumers to sample new music and buy later (Peitz and Waelbroeck, 2006). Furthermore, Mortimer et al. (2012) show that revenues for complementary goods (like live performances) could rise. Despite this ambiguity, the vast majority of the empirical literature has shown that free online copying does reduce revenue across a wide variety of industries (Zentner, 2006; Waldfogel, 2010; Danaher and Smith, 2014; Godinho de Matos et al., 2018; Reimers, 2016). The core open marketing questions therefore relate to the development and distribution of complementary goods. To the extent that free and even illegal channels for distribution are inevitable for many digital goods, what are the opportunities for intermediaries to facilitate profitable exchange? In other words, besides selling tickets to live music, it is important to understand the ways in which industry has reacted to these changes. As free video distribution becomes widespread through platforms like YouTube and through illegal channels, it may generate incentives to offer subscription bundles (as in Netflix) rather than charging per view (as in the cinema). It may also generate incentives to produce merchandizable content, and then earn profits through toy and clothing licensing, theme parks, and other channels. This in turn may affect the role of entertainment conglomerates in the industry. If merchandizing is necessary, then companies like Disney may have an advantage because they own theme parks, retail stores, and other channels. Aside from Mortimer et al. (2012), our empirical understanding is limited as to how complements to digital information goods arise, how they work, and how they change the nature of the firm.

2.3 Product: What are the motivations for providing digital products given their non-excludability? Intellectual property laws exist because they can generate incentives to innovate and create new products. The non-rival nature of digital goods leads to widespread viola-

3 Lower transportation costs

tion of copyright and questions about what constitutes fair use. Many people consume digital products without paying for them. While the owners of copyrighted works are harmed, the provision of a product at zero price increases consumer surplus and eliminates deadweight loss. It also allows for valuable derivative works. In a static model, this is welfare-enhancing. Consumers benefit more than producers are hurt. Therefore, the key question with respect to digitization and copyright is with respect to the creation of new products. Waldfogel (2012) provides evidence that suggests that the quality of music has not fallen since Napster began facilitating free online copying in 1999. While digitization did reduce incentives to produce because of online copying, the costs of production and distribution fell as well. For distribution, low marginal costs of reproduction meant that early-stage artists could distribute their music widely and get known, even without support from a music label or publisher (Waldfogel, 2016; Waldfogel and Reimers, 2015). The title of the new book ‘Digital Renaissance’ (Waldfogel, 2018) summarizes a decade of his research emphasizing that digitization has led to more and better quality entertainment despite increased copying, largely because of reduced production and distribution costs. In our view, the argument Waldfogel presents is convincing. The challenge for marketing scholars is to extend this literature by understanding better the profitable provision of goods which serve as complements to digital goods.

2.4 Promotion: What is the role of aggregators in promoting digital goods? Non-rivalry means that it is easier for companies to replicate and aggregate the digital content of other firms. Such aggregators both compete with the producing firms content and promote the producing firm’s content (Dellarocas et al., 2013). Thus, the distinction between advertisement and product can become ambiguous. This tension has been empirically examined in the context of the new aggregators. Examining policy changes in Europe, three different studies have shown that news aggregators served more as promotion tools than to cannibalize the revenues of producing firms (Calzada and Gil, 2017; Chiou and Tucker, 2017a; Athey et al., 2017b). For example, Calzada and Gil (2017) show that shutting down Google News in Spain substantially reduced visits to Spanish news sites. Chiou and Tucker (2017a) found similar evidence of market expansion looking at a contract dispute between the Associated Press and Google News. Therefore, in general empirical evidence suggests that news aggregators appear to have a market expansion effect rather than being cannibalizing.

3 Lower transportation costs Information stored in bits can be transported at the speed of light. Therefore digital goods and digital information can be transported anywhere at near-zero cost. Furthermore, the transportation cost to the consumer of buying physical goods online can be relatively low. As emphasized by Balasubramanian (1998), the transportation costs



CHAPTER 5 Digital marketing

of traveling to an offline retailer are reduced, even if an online retailer still needs to ship a physical product.

3.1 Placement: Does channel structure still matter if transportation costs are near zero? Digitization added a new marketing channel (Peterson et al., 1997). For digital goods, this channel is available to anyone with an internet connection. For physical goods bought online and shipped, this channel is available to anyone within the shipping range. In the United States, this means just about anybody with a mailing address. A variety of theory papers examined the impact of the online channel on marketing strategy (Balasubramanian, 1998; Liu and Zhang, 2006). These papers build on existing models of transportation costs that build themselves on Hotelling (1929). They model online retailers as being equidistant from all consumers, while consumers have different costs of visiting offline retailers, depending on their location. Empirical work has generally supported the use of these models. The new channel competed with the existing offline channels for goods that needed to be shipped (Goolsbee, 2001; Prince, 2007; Brynjolfsson et al., 2009; Forman et al., 2009; Choi and Bell, 2011) and for goods that could be consumed digitally (Sinai and Waldfogel, 2004; Gentzkow, 2007; Goldfarb and Tucker, 2011a,d; Seamans and Zhu, 2014; Sridhar and Sriram, 2015). Forman et al. (2009) aim to explicitly test the applicability of Balasubramanian (1998) to the context of online purchasing on Amazon. Using weekly data on top-selling Amazon books by US city, the paper examines changes in locally top-selling books when offline stores open (specifically Walmart, Target, Barnes & Noble, and Borders). The paper shows that when offline stores open, books that are relatively likely to appear in those stores disproportionately fall out of the top sellers list. The paper interprets this as evidence of offline transportation costs by consumers of visiting retailers: When a retailer opens nearby, consumers become more likely to buy books offline. If the retailer is relatively far away, then consumers are more likely to buy online. A key limitation of this paper is that the data are ranked sales, rather than actual purchase data. Work with purchase data is more limited, though Choi and Bell (2011) show similar evidence of online-offline substitution in the context of online diaper purchasing. This result of online-offline substitution is not always evident. For multi-channel retailers, while substitution does occur in many situations, there are particular situations in which the offline channel enhances the online channel, such as when a brand is relatively unfamiliar in a location where a new store opens (Wang and Goldfarb, 2017; Bell et al., 2018). In particular, Wang and Goldfarb (2017) examined US sales at a large clothing retailer with a substantial presence both online and offline. During the period of study, the retailer substantially expanded the number of offline stores. Using internal sales data, as well as information on website visits, the analysis compares locations in which sales were high at the beginning of the sample period with locations in which sales were low. For places that already had high sales, opening an offline store reduced online purchasing. In these places, online and offline served as

3 Lower transportation costs

competing channels, consistent with the prior literature. In contrast, in locations in which sales were low, the opening of offline stores led to an increase in online sales. This increase occurred in a variety of product categories, not only those that required the customer to know whether the clothes fit. The evidence suggests a marketing communications role for the offline channel. These results suggest more nuance than simply ‘online is a substitute for offline.’ They suggest some validity to the widespread use among practitioners of the jargon term ‘omnichannel’ (Verhoef et al., 2015). In particular, while the previous paragraph summarized a long and careful literature that suggests the arrival of online competition reduced offline sales – and that new offline competitors reduce online sales – within a firm the results are more nuanced. The offline store can drive traffic to the online channel and in doing this it serves two roles: Sales channel and communications channel. This suggests the possibility that an online store can also drive traffic to an offline channel – there is a nascent literature that explores this but, as might be expected, establishing causality is hard.

3.2 Product: How do low transportation costs affect product variety? In the absence of the online channel, all purchases would be made offline. Each person would be constrained to purchase the products available locally. As highlighted above in the context of the long tail, the online channel provides access to a much wider variety of products and services. Sinai and Waldfogel (2004) show that online media enables non-local news consumption. In particular, they show that digitization makes it relatively easy for African Americans living in primarily white neighborhoods to read similar news to African Americans living in African American neighborhoods. In addition, digitization makes it relatively easy for whites living in African American neighborhoods to read similar news to whites living in white neighborhoods. Similarly, Gandal (2006) shows that online media enables local language minorities to read news in their language of choice. Choi and Bell (2011) document that sales of niche diaper brands are higher online in zipcodes where such brands are generally not available offline. Low transportation costs enable product variety, by reducing geographic barriers to distribution. While tastes are spatially correlated (Blum and Goldfarb, 2006; Quan and Williams, 2018), distribution is not limited by local tastes. As discussed earlier, Quan and Williams (2018) show that spatially correlated tastes are reflected in offline offerings. This means that the welfare impact of online product variety is smaller than it might seem if measured by number of varieties available. Combined, these results suggest that the welfare impact of increased product variety will disproportionately accrue to people with distinct preferences from their neighbors, what Choi and Bell (2011) call ‘preference minorities’. This provides an additional layer for interpreting the results of Quan and Williams (2018). If the welfare impact of increased online variety accrues to local minorities, then it might indicate a larger benefit than straight utilitarian analysis might suggest.



CHAPTER 5 Digital marketing

3.3 Pricing: Does pricing flexibility increase because transportation costs are near zero? Low transportation costs constrain online pricing in several ways. First, there is the competition highlighted above, both between the various online retailers and between online and offline retailers. Second, one aspect of low online transportation costs involves the reduced physical effort when consumers are not required to carry items home from the store. Pozzi (2013) shows that online grocery buyers stockpile more than offline grocery buyers, purchasing in bulk when a discount appears. This ability to stockpile further restricts online pricing strategies. Third, it is difficult, though not impossible, to charge different prices for the same item at different locations; the media has not treated retailers well who have been caught charging different online prices to buyers in different locations (even if to match local offline store prices) (Valentino-Devries et al., 2012; Cavallo, 2017).

3.4 Promotion: What is the role of location in online promotion? Location matters in online promotion. This is partly because – as mentioned above – tastes are spatially correlated. In addition, a long sociology literature, at least since Hampton and Wellman (2003), shows that social networks are highly local. Marketers have long known that word of mouth is perhaps the most effective form of promotion (Dellarocas, 2003). Online word of mouth has become increasingly important, as we discuss in the context of verification costs, but offline word of mouth remains a key tool for promotion even for products sold entirely online. Even though individuals can communicate with anyone anywhere, much online communication is between people who live in the same household or work in the same building. Promotion through local social networks can be effective (Bell and Song, 2007; Choi et al., 2010). For example, in the context of online crowdfunding of music, Agrawal et al. (2015) show that local social networks provided early support that helped promote musicians to distant strangers. There is also suggestive evidence that online recommendations are more effective if provided by people who live nearby (Forman et al., 2008). In other words, although the transportation costs for digital costs are near zero, and the transportation costs for consumers of visiting stores are reduced, a different type of transportation cost persists. This leads to spatially correlated social networks, which in turn leads to spatially correlated word-of-mouth promotion. While the online word-of-mouth literature has grown rapidly, there is still little understanding of how online and offline social networks interact. We expect the quantitative marketing literature to be well placed to address this. As Facebook and online social network platforms become increasingly important promotion channels, this gap in understanding limits our ability to design online promotion strategies.

4 Lower tracking costs

4 Lower tracking costs Literatures on search, replication, and transportation all began in the 1990s and were well established in the early digital marketing literature. More recently, it has become clear that two additional cost shifts have occurred: Tracking costs and verification costs have fallen. It is easy to track digital activity. Tracking is the ability to link an individual’s behavior digitally across multiple different media, content venues, and purchase contexts. Often, information is collected and stored automatically. Tracking enables extremely fine segmentation, and even personalization (Ansari and Mela, 2003; Murthi and Sarkar, 2003; Hauser et al., 2014). This has created new opportunities for marketers in promotion, pricing, and product offerings. The effect in placement has been weaker simply because often there are coordination difficulties between different vertical partners that make tracking harder.

4.1 Promotion: How do low tracking costs affect advertising? Marketing scholars have been particularly prolific in studying the impact of low tracking costs on advertising. The improved targeting of advertising through digital media is perhaps the dominant theme in the online advertising literature (Goldfarb and Tucker, 2011b; Goldfarb, 2014). Many theoretical models on how digitization would affect advertising emphasize targeting (Chen et al., 2001; Iyer et al., 2005; GalOr and Gal-Or, 2005; Anand and Shachar, 2009). Much of this work has emphasized online-offline competition when online advertising is targeted, and the scarcity of advertising space online and offline (Bergemann and Bonatti, 2011; Athey et al., 2016). A large empirical literature has explored various strategies for successful targeting. Goldfarb and Tucker (2011c) show that targeted banner advertising is effective, but only as long as it does not take over the screen too much. Targeting works when it is subtle, in the sense that it has the biggest impact on plain banner ads, relative to how it increases the effectiveness of other types of ads. Tucker (2014) shows a related result in the context of social media targeting. Targeting works when it is not too obvious to the end consumer that an ad is closely targeted. Other successful targeting strategies include retargeting (to a partial extent) (Lambrecht and Tucker, 2013; Bleier and Eisenbeiss, 2015; Johnson et al., 2017a), targeting by stage in the purchase funnel (Hoban and Bucklin, 2015), time between ad exposures (Sahni, 2015), search engine targeting (Yao and Mela, 2011), and targeting using information on mobile devices (Bart et al., 2014; Xu et al., 2017). In each case, digitization facilitates targeting and new opportunities for advertising. In addition to better targeting, better tracking enables the measurement of advertising effectiveness (Goldfarb and Tucker, 2011b). Early attempts to measure banner advertising effectiveness included Manchanda et al. (2006) and Rutz and Bucklin (2012). Tracking makes it relatively straightforward to identify which customers see ads, to track purchases, and to randomize advertising between treatment and control groups. More generally, prior to the diffusion of the internet, advertising measure-



CHAPTER 5 Digital marketing

ment has relied on aggregate correlations (with the exception of a small number of expensive experiments such as Lodish et al., 1995). Perhaps the clearest result of the increased ability to run advertising experiments because of better tracking is the finding that correlational studies of advertising effectiveness are deeply flawed. For example, Lewis et al. (2011) use data from banner ads on Yahoo to show the existence of a type of selection bias that they label ‘activity bias’. This occurs because users who are online at the time an advertisement is shown are disproportionately likely to undertake other online activities, including those used as outcome measures in advertising effectiveness studies. They show activity bias by comparing a randomized field experiment to correlational individual-level analysis. Measured advertising effectiveness is much lower in the experimental setting. One interpretation of this result would be to treat correlational analysis as an upper bound on the effectiveness of advertising. Gordon et al. (2019) demonstrate that this is not correct, and instead it is best to treat correlational analysis as having no useful information for measuring advertising effectiveness in the context they study. They examine a series of advertising field experiments on Facebook. Consistent with Lewis et al. (2011), they show that correlational analysis fails to measure advertising effectiveness properly. Importantly, they show that sometimes correlational analysis underestimates the effectiveness of an advertisement. Schwartz et al. (2017) demonstrate the usefulness of reframing experimental design as a multi-armed bandit problem. Measurement challenges extend beyond the need to run experiments. Ideally, advertising effectiveness would be measured based on the increase in long term profits caused by advertising. Given the challenge in measuring long term profits, research has focused on various proxies for advertising success. For example, in measuring the effectiveness of banner advertising, Goldfarb and Tucker (2011c) used data from thousands of online advertising campaigns and randomized advertising into treatment and control groups. The analysis delivered on the promise of better measurement, but the outcome measure was far from a measure of long term profits. In order to get a systematically comparable outcome measure across many campaigns, the paper used the stated purchase intent of people who took a survey after having randomly allocated into seeing the advertisement or seeing a public service announcement. Advertising effectiveness was measured as the difference in stated purchase intent between the treatment and control groups. This is a limited measure of effectiveness in at least two ways. First, only a small fraction of those who saw the ads (whether treatment or control) are likely to take the survey and so the measure is biased to the type of people who take online surveys. Second, purchase intent is different from sales (which in turn is different from long term profits). In our view, for the purpose of comparing the effectiveness of different types of campaigns, this measure worked well. We were able to show that contextually targeted advertising increases purchase intent compared to other kinds of advertising, and that obtrusive advertising works better than plain advertising. Furthermore, we found that ads that were both targeted and obtrusive lifted purchase intent less than ads that were either targeted or obtrusive but not both. At the same time, this mea-

4 Lower tracking costs

sure would not be useful for measuring the return on advertising investment or for determining the efficient allocation of advertising spending. To address questions like these, subsequent research has built new tools for measuring actual sales. Lewis and Reiley (2014) link online ads to offline sales using internal data from Yahoo! and a department store. The paper linked online user profiles to the loyalty program of the department store using email addresses. With this measure, they ran a field experiment on 1.6 million users that showed that online advertising increases offline sales in the department store. While still not a measure of long term profits, this outcome measure is more directly related to the true outcome of interest. This came at the cost of challenges in comparing across types of campaigns and across categories. This study was possible because the research was conducted by scholars working in industry. Such industry research has been important in developing better measures of outcomes, as well as more effective experimentation. Other examples include Lewis and Nguyen (2015), who show spillovers from display advertising to consumer search; Johnson et al. (2017a), who provide a substantially improved method for identifying the control group in the relevant counterfactual to firms that choose not to advertise; and Johnson et al. (2017a), who examine hundreds of online display ad campaigns to show that they have a positive effect on average. Even in the presence of experiments and reliable outcome measures, Lewis and Rao (2015) show that advertising effects are relatively low powered. In other words, the effect of seeing one banner ad once on an eventual purchase is small. It is meaningful and can deliver a positive return on investment, but demonstrating that requires a large number of observations. Johnson et al. (2017b) show that better controls can increase the power of the estimated effects, though this effect is modest. In addition, they found that careful experimental design and sample selection can lead to a substantial boost in power. In general, given these findings, advancing the literature poses some challenges for marketing scholars. This is because it appears increasingly necessary, given the high variance of advertising effectiveness and small effect sizes, to work with marketing platforms to calibrate effects. This need is magnified because of the use of advertising algorithms in these platforms which make understanding a counterfactual problematic (Eckles et al., 2018). It is unlikely that advertising platforms would encourage researchers to study newer issues facing their platforms such as algorithmic bias (Lambrecht and Tucker, 2018) or the spread of misinformation through advertising (Chiou and Tucker, 2018). This is important because some of the biggest research questions that are open in digital marketing communications are no longer simply about advertising effectiveness. Instead, there are now large policy issues about the consequences of the ability to track and target consumers in this way. An example of the challenges facing the online targeting policy debate, is the extent to which regulators should be worried about advertising that is deceptive or distortionary. Though there has been much discussion about the actions of firms such as Cambridge Analytica that use Facebook data to target political ads, as of yet there has been limited discussion in marketing



CHAPTER 5 Digital marketing

about the issues of deceptive uses of targeting. Again, we expect this will be a fruitful avenue of research.

4.2 Pricing: Do lower tracking costs enable novel forms of price discrimination? Low tracking costs can enable new ways to price discriminate. Early commentators on the impact of digitization emphasized this potential (Shapiro and Varian, 1998; Smith et al., 2001; Bakos, 2001). Tracking means that firms can observe customer behavior and keep tabs on customers over time. This enables behavioral price discrimination (see Fudenberg and Villas-Boas (2012) and Fudenberg and Villas-Boas (2007) for reviews). This literature emphasizes how identifying previous customers affects pricing strategy and profitability (Villas-Boas, 2004; Shin and Sudhir, 2010; Chen and Zhang, 2011). While digital price discrimination has received a great deal of attention in the theory literature, empirical support is limited. Other examples of online price discrimination include Celis et al. (2014) and Seim and Sinkinson (2016). Perhaps the best example is Dube and Misra (2017), who document that targeting many prices to different customers can be profitable in the context of an online service. This paper relies on a large scale field experiment to learn the optimal price discrimination policy. It then demonstrates that the learned policy outperforms other pricing strategies, using an experiment. In other words, the paper demonstrates the opportunity in price targeting and convincingly shows it works in a particular context using experimental design. One area where we have seen high levels of price discrimination is online advertising. Individual-level tracking means that there are thousands of advertisements to price to millions of consumers. Price discrimination is feasible but price discovery is difficult. As a consequence, digital markets typically use auctions to determine prices for advertising. Auctions facilitate price discovery when advertisements can be targeted to individuals based on their current and past behavior. In the 1990s, online advertising was priced according to a standard rate in dollars (or cents) per thousand impressions. Early search engine Goto.com was the first to recognize that an auction could be used to price discriminate in search advertising. Rather than a fixed price per thousand on the search page, prices could vary by search term. Today, both search and display advertising run on this insight, and a large literature has explored various auction formats for online advertising (Varian, 2007; Edelman et al., 2007; Levin and Milgrom, 2010; Athey and Ellison, 2011; Zhu and Wilbur, 2011; Arnosti et al., 2016). As long as an auction is competitive, the platform is able to price discriminate with much more detail than before. While this might generate more efficient advertising in the sense that the highest bidder values the advertisement the most, it also may enable the platform to capture more of the surplus from advertising. In other words, by enabling better price discrimination, advertising auctions likely lead to the familiar welfare effects of price discrimination between buyers and sellers, in this case the buyers and sellers of advertising. The impact on

4 Lower tracking costs

consumer welfare is ambiguous and likely depends on the particular way in which advertising enters the utility function.

4.3 Product: How do markets where the customer’s data is the ‘product’ lead to privacy concerns? Tracking is an opportunity for marketers to segment. It also creates privacy concerns. Therefore, low tracking costs have led to a resurgence of policy interest in privacy. A core question in the privacy literature is whether privacy is an intermediate good that is only valuable because it affects consumers indirectly (such as through higher prices) or whether privacy a final good that is valued in and of itself (Farrell, 2012). The theoretical literature has focused on privacy as an intermediate good (Taylor, 2004; Acquisti and Varian, 2005; Hermalin and Katz, 2006), while policy discussions often emphasize privacy as a final good. Research outside of marketing such as Acquisti et al. (2013, 2015) have argued that this discussion has been complicated by inconsistent behavior of consumers towards their desires for privacy – leading to a privacy paradox – where consumers behave in a way which contradicts their stated preferences (Athey et al., 2017a). Many examples of privacy regulation have been aimed at marketers. Such regulation limits what marketers can do with data. It affects the nature and distribution of outcomes (Goldfarb and Tucker, 2012). For example, European privacy regulation in the early 2000s substantially reduced the effectiveness of online advertising in Europe (Goldfarb and Tucker, 2011e). Assuming that opt-in policies mean that fewer users can be tracked, Johnson (2014) builds a structural model to estimate the financial costs of opt-in privacy policies relative to opt-out. The estimates suggest that opt-in policies can have substantial financial costs to platforms. While negative effects of privacy regulation have been shown in a variety of contexts (Miller and Tucker, 2009, 2011; Goldfarb and Tucker, 2011e; Miller and Tucker, 2018; Johnson et al., 2017c), firm-implemented policies that protect the privacy of their consumers can have strongly positive effects (Tucker, 2012, 2014). Privacy regulation also affects the nature of product market competition (Campbell et al., 2015). It can either constrain the ability of smaller firms to compete cost-effectively (Campbell et al., 2015), or lead firms to intentionally silo data about consumers (Miller and Tucker, 2014). In our view, the empirical privacy literature in marketing is surprisingly sparse. Marketers have an important role to play in the debate about data flows because we are among the primary users of data. While there has been some progress on research with respect to marketing policy, we have little empirical understanding of the strategic challenges that relate to privacy. How should firms balance customer demands for privacy and the usefulness of data to provide better products? What is the best way to measure the benefits of privacy to consumers, given that short term measures suggest consumers are often not willing to pay much to protect their privacy, while the policy debate suggests consumers may care in the longer term? Overall, there are a number of opportunities for marketing scholars to provide a deeper understanding of



CHAPTER 5 Digital marketing

when increased privacy protection will generate strategic advantage. We expect that one such opportunity will be regulations, such as the EU General Data Protection Regulation (GDPR) which came into effect in May 2018. It was significant as the first privacy regulation which has had a truly global impact and therefore affects not just firms within the EU but across the world.

4.4 Placement: How do lower tracking costs affect channel management? Lower tracking costs can make it easier for a manufacturer to monitor behavior in retail channels by tracking prices available online. Israeli (2018) discusses the usefulness of minimum advertised pricing restrictions that manufacturers sometimes impose on retailers to reduce downstream price competition. Using a quasi-experimental setting, the paper demonstrates that easier tracking of online prices makes minimum advertised pricing policies more effective. Easier tracking enables different levels of control in channel relationships. We believe there are opportunities for further research in this area, especially in understanding how conflicts over control of digital technologies affects channel conflict. A recent example of such work is Cao and Ke (2019), who investigate how channel conflict emerges when it is possible to pinpoint precisely a pair of eyeballs that may be interested in a particular search query and try and advertise to them.

5 Reduction in verification costs Reduced tracking costs have had an additional effect of improving verification. This was not anticipated in the early literature which emphasized online anonymity. Perhaps the most familiar verification technology in marketing is the brand (Shapiro, 1983; Erdem and Swait, 1998; Tadelis, 1999; Keller, 2003). The ability to verify online identity and reputation without the need to invest in mass market branding has affected marketing in a variety of ways. Verification is likely to continue to improve, with the advent of new digital verification technologies such as blockchain (Catalini and Gans, 2016).

5.1 Pricing: How willingness to pay is bolstered by reputation mechanisms Digital markets involve small players who may be unfamiliar to potential customers. An estimated 88% of online Visa transactions are with a merchant that the customer does not visit offline (Einav et al., 2017). While brands do play a role online (Brynjolfsson and Smith, 2000; Waldfogel and Chen, 2006), for small players to thrive, other verification mechanisms are needed. Online reputation mechanisms reduce the importance of established brands and enable consumers to trust small online sellers. Furthermore, Hollenbeck (2018) provides evidence that online reputation mecha-

5 Reduction in verification costs

nisms can reduce the importance of offline brands. In particular, the paper demonstrates that high online ratings lead to higher sales in offline independent hotels. Luca (2016) finds a similar result for restaurants. There are many ways that a platform might regulate the behavior of its users. This includes systems that ban users who behave undesirably. However, the majority of platforms lean on online ratings systems. In such systems, past buyers and sellers post ratings for future market participants to see. There is a large literature on the importance of eBay’s online rating system to its success, as well as a variety of papers that explore potential changes and improvements to that system and their impact on prices, market outcomes, and willingness to pay (Resnick and Zeckhauser, 2002; Ba and Pavlou, 2002; Lucking-Reiley et al., 2007; Cabral and Hortacsu, 2010; Hui et al., 2016). For example, Hui et al. (2016) demonstrate that eBay’s reputation system is effective in reducing bad behavior on the part of sellers, but it needs to be combined with eBay’s ability to punish the worst behavior in order to create a successful marketplace on which small sellers can thrive. Perhaps the key theme of this literature is that online reputation mechanisms increase willingness to pay and sometimes enable markets that otherwise would not exist.

5.2 Product: Is a product’s ‘rating’ now an integral product feature? In addition to enhancing trust and willingness-to-pay, ratings systems provide information on product quality. The rating becomes a key feature of a platform. Ratings inform consumers about the best products available within the platform, and are therefore a key element of the overall product offering. Platforms benefit because rating information guides consumers to the highest quality products. For example, Chevalier and Mayzlin (2006) demonstrate that positive reviews lead to higher sales in the context of online retail. Even online identities that are consistent over time but not connected to a name or home address can influence consumption (Yoganarasimhan, 2012). For some online platforms, such as Yelp, their product is to provide ratings about offline settings. As noted above, Luca (2016) and Hollenbeck (2018) show that high online ratings improve sales in offline restaurants and hotels, particularly for independents. In both cases, the online rating system is a substitute for a widely known chain brand. Godes and Silva (2012) also show that such ratings have the potential to exhibit dynamics that reflect real economic effects. This insight is built on by Muchnik et al. (2013), who document herding in ratings behavior on a news website. In addition to the idea of a rating system controlled by the platform as being an integral product feature, organic and digital forms of word-of-mouth are also essential heuristics that consumers use when making purchase decisions about a product (Godes and Mayzlin, 2009). Work such as Toubia and Stephen (2013) has also studied why it is that consumers post word of mouth on platforms such as Twitter, and has drawn a distinction between the intrinsic and extrinsic utility that consumers derive from posting. Lambrecht et al. (2018), however, suggest that some of the most attractive potential spreaders of word-of-mouth, people who start memes on social platforms, are also the most resistant to advertising.



CHAPTER 5 Digital marketing

5.3 Placement: How can channels reduce reputation system failures? In addition to understanding the successes of reputation systems, a wide literature has explored when reputation systems fail. A key source of failure is the inability to verify whether the person doing the online rating actually experienced the product. Mayzlin et al. (2014) and Luca and Zervas (2016) show evidence that firms seem to give themselves high ratings while giving low ratings to their competitors. A related issue is selection bias in who chooses to provide ratings (Nosko and Tadelis, 2015). Anderson and Simester (2014) show evidence of a related problem: Many reviewers never purchase the product. They review anyway and these reviews distort the information available. In response to these and other concerns, platforms regularly update their reputation systems. For example, Fradkin et al. (2017) document two experiments made at Airbnb to improve their reputation system. What was striking about these experiments is that rather than too many ‘fake’ reviews being a problem, instead here the challenge the platform faced was incentivizing users to give accurate accounts of negative experiences. This paper established that too much ‘favorable’ opinion can be a problem in such settings. The existing literature has provided a broad sense of when and how online reputation systems might fail. This suggests new opportunities for scholars focused on market design. Given the challenges in building online reputation systems, it is important to carefully model and build systems that are robust to these failures.

5.4 Promotion: Can verification lead to discrimination in how goods are promoted? Improved verification technology meant that the early expectations of online anonymity have not been met. For example, early literature showed that online car purchases could avoid the transmission of race and gender information, thereby leading to a reduction of discrimination based on these characteristics (Scott Morton et al., 2003). As verification technology has improved, this anonymity has largely disappeared from many online transactions. This has led to concerns that online identities can be used to discriminate. For example, when information about race or gender is revealed online, consumers receive advertisements for different products and may even receive offers of different prices (Pope and Sydnor, 2011; Doleac and Stein, 2013; Edelman and Luca, 2014). One recent example of this has been the question of algorithmic bias in the way that advertising is distributed – something that has been highlighted by computer scientists (Sweeney, 2013; Datta et al., 2015). In Marketing and Economics, Lambrecht and Tucker (2018) show that a career ad that was intended to highlight careers in the STEM fields that was shown to more men than women, did so due to the price mechanism underlying the distribution of ads. Male eyeballs are cheaper than female eyeballs, so an ad algorithm that is trying to be cost-effective will show any ad to fewer women than men.

6 Conclusions

This type of apparent algorithmic bias is a surprising consequence of improvements in verification technology. In the past, it was not possible to verify gender easily. Instead, firms used content to separate out likely gender affiliation – such as assuming men were more likely to read fishing magazines and women more likely to read beauty magazines. However, in a digital ecosystem where characteristics such as gender can be verified, it means that there is now the possibility that inadvertently our ability to classify gender could lead to perceptions of bias in areas where the distribution of content in a non-gender-neutral way is problematic.

6 Conclusions Digital marketing is inherently different to offline marketing due to a reduction of five categories of costs: Search, reproduction, transportation, tracking, and verification. In defining the scope of this article, we drew boundaries. We focus on understanding the impact of the technology on marketing using an economic perspective. Therefore, we did not discuss much work written in marketing that focuses on methodology, such as the statistical modeling in digital environments literature (Johnson et al., 2004; Moe and Schweidel, 2012; Netzer et al., 2012). We also did not detail the consumer behavior literature on the effect of digital environments (Berger and Milkman, 2012; Castelo et al., 2015). This overview highlights that changes to marketing that result from the change of costs inherent in the digital context are not as obvious as initial economic models may imply. Instead, as may be expected, the complexities of both firm and consumer behavior have led to less than predictable outcomes. It is these less predictable outcomes which have allowed marketing contexts to inform the economics literature on the likely effects of digitization outside of marketing. Going forward, we anticipate the most influential work to fall into one of three categories. First, there are still many opportunities to unpack the existing models and identify new complexities in how the drop in search, reproduction, transportation, tracking, and verification costs affect various aspects of marketing. Many recent papers fall in this category, including Blake et al. (2015), Simonov et al. (2018a), Hollenbeck (2018), and Farronato and Fradkin (2018). In the above discussion, we have highlighted some areas that we see as particularly important topics for future research. Second, as policies change, new business models arise, and new technologies diffuse, there will be opportunities to understand these changes in light of existing models. Recent papers of this type include Bart et al. (2014), Miller and Tucker (2018), Lambrecht and Tucker (2018), and Johnson et al. (2017c). Third, some of the changes brought by digitization and other advances in information technology will require recognition of different types of cost changes. Just as the early internet literature emphasized search, replication, and transportation costs, and only later were tracking and verification costs recognized as important consequences, we anticipate technological change to lead to the application of other well-established



CHAPTER 5 Digital marketing

models into new contexts. For example, one recent hypothesis is that recent advances in machine learning can be framed as a drop in the cost of prediction which can be modeled as a reduction in uncertainty (Agrawal et al., 2018). For each of these categories, economic theory plays a fundamental role. Search theory provided much of the initial impetus for the digital marketing literature. It provided hypotheses on prices, price dispersion, and product variety. Some of these hypotheses were supported, but others were not. In turn, this generated new models that could explain the data, and the cycle continued. Models of reproduction costs, transportation, tracking, and verification played similar roles. This led to a much deeper understanding of the consequences of digitization on marketing.

References Acquisti, A., Brandimarte, L., Loewenstein, G., 2015. Privacy and human behavior in the age of information. Science 347 (6221), 509–514. Acquisti, A., John, L.K., Loewenstein, G., 2013. What is privacy worth? The Journal of Legal Studies 42 (2), 249–274. Acquisti, A., Varian, H.R., 2005. Conditioning prices on purchase history. Marketing Science 24 (3), 367–381. Agrawal, A., Catalini, C., Goldfarb, A., 2015. Crowdfunding: geography, social networks, and the timing of investment decisions. Journal of Economics and Management Strategy 24 (2), 253–274. Agrawal, A., Gans, J., Goldfarb, A., 2018. Prediction Machines: The Simple Economics of Artificial Intelligence. Harvard Business Press. Aguiar, L., Waldfogel, J., 2018. Quality predictability and the welfare benefits from new products: evidence from the digitization of recorded music. Journal of Political Economy 126 (2), 492–524. Anand, B., Shachar, R., 2009. Targeted advertising as a signal. Quantitative Marketing and Economics 7 (3), 237–266. Anderson, C., 2006. The Long Tail. Hyperion. Anderson, E.T., Simester, D.I., 2014. Reviews without a purchase: low ratings, loyal customers, and deception. Journal of Marketing Research 51 (3), 249–269. Andrews, M., Luo, X., Fang, Z., Ghose, A., 2015. Mobile ad effectiveness: hyper-contextual targeting with crowdedness. Marketing Science 35 (2), 218–233. Anenberg, E., Kung, E., 2015. Information technology and product variety in the city: the case of food trucks. Journal of Urban Economics 90, 60–78. Ansari, A., Mela, C., 2003. E-customization. Journal of Marketing Research 40 (2), 131–145. Armstrong, M., 2006. Competition in two-sided markets. The Rand Journal of Economics 37 (3), 668–691. Arnosti, N., Beck, M., Milgrom, P., 2016. Adverse selection and auction design for Internet display advertising. The American Economic Review 106 (10), 2852–2866. Athey, S., Calvano, E., Gans, J.S., 2016. The impact of consumer multi-homing on advertising markets and media competition. Management Science 64 (4), 1574–1590. Athey, S., Catalini, C., Tucker, C.E., 2017a. The Digital Privacy Paradox: Small Money, Small Costs, Small Talk. Working Paper. MIT. Athey, S., Ellison, G., 2011. Position auctions with consumer search. The Quarterly Journal of Economics 126 (3), 1213–1270. Athey, S., Mobius, M., Pal, J., 2017b. The Impact of News Aggregators on Internet News Consumption: The Case of Localization. Mimeo. Stanford University. Ba, S., Pavlou, P., 2002. Evidence of the effect of trust building technology in electronic markets: price premiums and buyer behavior. Management Information Systems Quarterly 26, 243–268.


Bagwell, K., 2007. The economic analysis of advertising. In: Armstrong, M., Porter, R. (Eds.), Handbook of Industrial Organization, vol. 3. Elsevier, pp. 1701–1844 (Chap. 28). Bakos, Y., 2001. The emerging landscape for retail e-commerce. The Journal of Economic Perspectives 15 (1), 69–80. Balasubramanian, S., 1998. Mail versus mall: a strategic analysis of competition between direct marketers and conventional retailers. Marketing Science 17 (3), 181–195. Bapna, R., Ramaprasad, J., Shmueli, G., Umyarov, A., 2016. One-way mirrors in online dating: a randomized field experiment. Management Science 62 (11), 3100–3122. Bar-Isaac, H., Caruana, G., Cunat, V., 2012. Search, design, and market structure. The American Economic Review 102 (2), 1140–1160. Bart, Y., Stephen, A.T., Sarvary, M., 2014. Which products are best suited to mobile advertising? A field study of mobile display advertising effects on consumer attitudes and intentions. Journal of Marketing Research 51 (3), 270–285. Baye, M.R., Morgan, J., 2004. Price dispersion in the lab and on the Internet: theory and evidence. The Rand Journal of Economics 35 (3), 449–466. Bell, D.R., Gallino, S., Moreno, A., 2018. Offline experiences and value creation in omnichannel retail. Available at SSRN. Bell, D.R., Song, S., 2007. Neighborhood effects and trial on the Internet: evidence from online grocery retailing. Quantitative Marketing and Economics 5 (4), 361–400. Bergemann, D., Bonatti, A., 2011. Targeting in advertising markets: implications for offline versus online media. The Rand Journal of Economics 42 (3), 417–443. Berger, J., Milkman, K.L., 2012. What makes online content viral? Journal of Marketing Research 49 (2), 192–205. Blake, T., Nosko, C., Tadelis, S., 2015. Consumer heterogeneity and paid search effectiveness: a large scale field experiment. Econometrica 83 (1), 155–174. Bleier, A., Eisenbeiss, M., 2015. Personalized online advertising effectiveness: the interplay of what, when, and where. Marketing Science 34 (5), 669–688. Blum, B., Goldfarb, A., 2006. Does the Internet defy the law of gravity? Journal of International Economics 70 (2), 384–405. Borenstein, S., Saloner, G., 2001. Economics and electronic commerce. The Journal of Economic Perspectives 15 (1), 3–12. Bronnenberg, B.J., Kim, J.B., Mela, C.F., 2016. Zooming in on choice: how do consumers search for cameras online? Marketing Science 35 (5), 693–712. Brown, J., Goolsbee, A., 2002. Does the Internet make markets more competitive? Evidence from the life insurance industry? Journal of Political Economy 110 (3), 481–507. Brynjolfsson, E., Hu, Y., Rahman, M., 2009. Battle of the retail channels: how product selection and geography drive cross-channel competition. Management Science 55 (11), 1755–1765. Brynjolfsson, E., Hu, Y., Simester, D., 2011. Goodbye Pareto principle, hello long tail: the effect of search costs on the concentration of product sales. Management Science 57 (8), 1373–1386. Brynjolfsson, E., Hu, Y.J., Smith, M.D., 2003. Consumer surplus in the digital economy: estimating the value of increased product variety at online booksellers. Management Science 49 (11), 1580–1596. Brynjolfsson, E., Smith, M., 2000. Frictionless commerce? A comparison of Internet and conventional retailers. Management Science 46 (4), 563–585. Cabral, L., Hortacsu, A., 2010. Dynamics of seller reputation: theory and evidence from eBay. Journal of Industrial Economics 58 (1), 54–78. Calzada, J., Gil, R., 2017. What Do News Aggregators Do? Evidence from Google News in Spain and Germany. Mimeo. University of Barcelona. Campbell, J., Goldfarb, A., Tucker, C., 2015. Privacy regulation and market structure. Journal of Economics & Management Strategy 24 (1), 47–73. Cao, X., Ke, T., 2019. Cooperative search advertising. Marketing Science 38 (1), 44–67. Castelo, N., Hardy, E., House, J., Mazar, N., Tsai, C., Zhao, M., 2015. Moving citizens online: using salience & message framing to motivate behavior change. Behavioral Science and Policy 1 (2), 57–68.



CHAPTER 5 Digital marketing

Catalini, C., Gans, J.S., 2016. Some Simple Economics of the Blockchain. SSRN Working Paper 2874598. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2874598. Cavallo, A., 2017. Are online and offline prices similar? Evidence from large multi-channel retailers. The American Economic Review 107 (1), 283–303. Celis, L.E., Lewis, G., Mobius, M., Nazerzadeh, H., 2014. Buy-it-now or take-a-chance: price discrimination through randomized auctions. Management Science 60 (12), 2927–2948. Chen, Y., Narasimhan, C., Zhang, Z.J., 2001. Individual marketing with imperfect targetability. Marketing Science 20 (1), 23–41. Chen, Y., Zhang, T., 2011. Equilibrium price dispersion with heterogeneous searchers. International Journal of Industrial Organization 29 (6), 645–654. Chevalier, J., Mayzlin, D., 2006. The effect of word of mouth online: online book reviews. Journal of Marketing Research 43, 345–354. Chiou, L., Tucker, C., 2017a. Content aggregation by platforms: the case of the news media. Journal of Economics and Management Strategy 26 (4), 782–805. Chiou, L., Tucker, C., 2017b. Search Engines and Data Retention: Implications for Privacy and Antitrust. Discussion Paper. National Bureau of Economic Research. Chiou, L., Tucker, C., 2018. Fake News and Advertising on Social Media: A Study of the Anti-Vaccination Movement. Working Paper 25223. National Bureau of Economic Research. Choi, J., Bell, D., 2011. Preference minorities and the Internet. Journal of Marketing Research 58 (3), 670–682. Choi, J., Hui, S.K., Bell, D.R., 2010. Spatiotemporal analysis of imitation behavior across new buyers at an online grocery retailer. Journal of Marketing Research 47 (1), 75–89. Chu, J., Chintagunta, P., Cebollada, J., 2008. Research note – a comparison of within-household price sensitivity across online and offline channels. Marketing Science 27 (2), 283–299. Coviello, L., Gneezy, U., Goette, L., 2017. A Large-Scale Field Experiment to Evaluate the Effectiveness of Paid Search Advertising. CESifo Working Paper Series No. 6684. Cullen, Z., Farronato, C., 2016. Outsourcing Tasks Online: Matching Supply and Demand on Peer-to-Peer Internet Platforms. Working Paper. Harvard University. Dana, James D.J., Orlov, E., 2014. Internet penetration and capacity utilization in the US airline industry. American Economic Journal: Microeconomics 6 (4), 106–137. Danaher, B., Smith, M.D., 2014. Gone in 60 seconds: the impact of the megaupload shutdown on movie sales. International Journal of Industrial Organization 33, 1–8. Danaher, B., Smith, M.D., Telang, R., 2013. Piracy and Copyright Enforcement Mechanisms. University of Chicago Press, pp. 25–61. Datta, A., Tschantz, M.C., Datta, A., 2015. Automated experiments on ad privacy settings. Proceedings on Privacy Enhancing Technologies 2015 (1), 92–112. Datta, H., Knox, G., Bronnenberg, B.J., 2018. Changing their tune: how consumers’ adoption of online streaming affects music consumption and discovery. Marketing Science 37 (1), 5–21. De los Santos, B.I., Hortacsu, A., Wildenbeest, M., 2012. Testing models of consumer search using data on web browsing and purchasing behavior. The American Economic Review 102 (6), 2955–2980. Dellarocas, C., 2003. The digitization of word of mouth: promise and challenges of online feedback mechanisms. Management Science 49 (10), 1407–1424. Dellarocas, C., Katona, Z., Rand, W., 2013. Media, aggregators, and the link economy: strategic hyperlink formation in content networks. Management Science 59 (10), 2360–2379. Diamond, P., 1971. A model of price adjustment. Journal of Economic Theory 3 (2), 156–168. Dinerstein, M., Einav, L., Levin, J., Sundaresan, N., 2018. Consumer price search and platform design in Internet commerce. The American Economic Review 108 (7), 1820–1859. Doleac, J.L., Stein, L.C., 2013. The visible hand: race and online market outcomes. The Economic Journal 123 (572), F469–F492. Dube, J.-P., Misra, S., 2017. Scalable Price Targeting. Working Paper. University of Chicago. Eckles, D., Gordon, B.R., Johnson, G.A., 2018. Field studies of psychologically targeted ads face threats to internal validity. Proceedings of the National Academy of Sciences, 201805363. Edelman, B., Luca, M., 2014. Digital Discrimination: The Case of Airbnb.com. HBS Working Paper.


Edelman, B., Ostrovsky, M., Schwarz, M., 2007. Internet advertising and the generalized second-price auction: selling billions of dollars worth of keywords. The American Economic Review 97 (1), 242–259. Einav, L., Farronato, C., Levin, J., Sundaresan, N., 2018. Auctions versus posted prices in online markets. Journal of Political Economy 126 (1), 178–215. Einav, L., Klenow, P., Klopack, B., Levin, J., Levin, L., Best, W., 2017. Assessing the Gains from ECommerce. Working Paper. Stanford University. Elberse, A., Eliashberg, J., 2003. Demand and supply dynamics for sequentially released products in international markets: the case of motion pictures. Marketing Science 22 (3), 329–354. Ellison, G., Ellison, S.F., 2005. Lessons about markets from the Internet. The Journal of Economic Perspectives 19 (2), 139–158. Ellison, G., Ellison, S.F., 2009. Search, obfuscation, and price elasticities on the Internet. Econometrica 77 (2), 427–452. Ellison, G., Ellison, S.F., 2017. Match Quality, Search, and the Internet Market for Used Books. Working Paper. MIT. Erdem, T., Swait, J., 1998. Brand equity as a signaling phenomenon. Journal of Consumer Psychology 7 (2), 131–157. Ershov, D., 2019. The Effect of Consumer Search Costs on Entry and Quality in the Mobile App Market. Working Paper. Toulouse School of Economics. Farrell, J., 2012. Can privacy be just another good? Journal on Telecommunications and High Technology Law 10, 251. Farronato, C., Fradkin, A., 2018. The Welfare Effects of Peer Entry in the Accommodation Market: The Case of Airbnb. Discussion Paper. National Bureau of Economic Research. Fleder, D.M., Hosanagar, K., 2009. Blockbuster culture’s next rise or fall: the impact of recommender systems on sales diversity. Management Science 55 (5), 697–712. Fong, N.M., Fang, Z., Luo, X., 2015. Geo-conquesting: competitive locational targeting of mobile promotions. Journal of Marketing Research 52 (5), 726–735. Forman, C., Ghose, A., Goldfarb, A., 2009. Competition between local and electronic markets: how the benefit of buying online depends on where you live. Management Science 55 (1), 47–57. Forman, C., Ghose, A., Wiesenfeld, B., 2008. Examining the relationship between reviews and sales: the role of reviewer identity disclosure in electronic markets. Information Systems Research 19 (3), 291–313. Fradkin, A., 2017. Search, Matching, and the Role of Digital Marketplace Design in Enabling Trade: Evidence from Airbnb. Mimeo. MIT Sloan School of Business. Fradkin, A., Grewal, E., Holtz, D., 2017. The determinants of online review informativeness: evidence from field experiments on Airbnb. Working Paper. MIT Sloan School of Management. Fudenberg, D., Villas-Boas, J.M., 2007. Behaviour-based price discrimination and customer recognition. In: Hendershott, T. (Ed.), Economics and Information Systems, vol. 1. Elsevier Science, Oxford. Fudenberg, D., Villas-Boas, J.M., 2012. Price discrimination in the digital economy. In: Oxford Handbook of the Digital Economy. Oxford University Press. Gal-Or, E., Gal-Or, M., 2005. Customized advertising via a common media distributor. Marketing Science 24, 241–253. Gandal, N., 2006. The effect of native language on Internet use. International Journal of the Sociology of Language 182, 25–40. Gentzkow, M., 2007. Valuing new goods in a model with complementarity: online newspapers. The American Economic Review 97 (32), 713–744. Ghose, A., Goldfarb, A., Han, S.P., 2013. How is the mobile Internet different? Search costs and local activities. Information Systems Research 24 (3), 613–631. Ghose, A., Ipeirotis, P.G., Li, B., 2012. Designing ranking systems for hotels on travel search engines by mining user-generated and crowdsourced content. Marketing Science 31 (3), 493–520. Godes, D., Mayzlin, D., 2009. Firm-created word-of-mouth communication: evidence from a field test. Marketing Science 28 (4), 721–739. Godes, D., Silva, J.C., 2012. Sequential and temporal dynamics of online opinion. Marketing Science 31 (3), 448–473.



CHAPTER 5 Digital marketing

Godinho de Matos, M., Ferreira, P., Smith, M.D., 2018. The effect of subscription video-on-demand on piracy: evidence from a household-level randomized experiment. Management Science 64 (12), 5610–5630. Goldfarb, A., 2014. What is different about online advertising? Review of Industrial Organization 44 (2), 115–129. Goldfarb, A., Tucker, C., 2011a. Advertising bans and the substitutability of online and offline advertising. Journal of Marketing Research 48 (2), 207–227. Goldfarb, A., Tucker, C., 2011b. Online advertising. In: Zelkowitz, M.V. (Ed.), Advances in Computers, vol. 81. Elsevier, pp. 289–315. Goldfarb, A., Tucker, C., 2011c. Online display advertising: targeting and obtrusiveness. Marketing Science 30, 389–404. Goldfarb, A., Tucker, C., 2011d. Search engine advertising: channel substitution when pricing ads to context. Management Science 57 (3), 458–470. Goldfarb, A., Tucker, C., 2011e. Privacy regulation and online advertising. Management Science 57 (1), 57–71. Goldfarb, A., Tucker, C., 2012. Privacy and innovation. In: Innovation Policy and the Economy, vol. 12. National Bureau of Economic Research, Inc. NBER Chapters. Goldfarb, A., Tucker, C., 2019. Digital economics. Journal of Economic Literature 57 (1), 3–43. Goolsbee, A., 2001. Competition in the computer industry: online versus retail. Journal of Industrial Economics 49 (4), 487–499. Gordon, B., Zettelmeyer, F., Bhargava, N., Chapsky, D., 2019. A comparison of approaches to advertising measurement: evidence from big field experiments at Facebook. Marketing Science 38 (2), 193–225. Grossman, G.M., Shapiro, C., 1984. Informative advertising with differentiated products. The Review of Economic Studies 51 (1), 63–81. Hall, J., Kendrick, C., Nosko, C., 2016. The Effects of Uber’s Surge Pricing: A Case Study. Working Paper. Uber. Hampton, K., Wellman, B., 2003. Neighboring in Netville: how the Internet supports community and social capital in a wired suburb. City and Community 2 (4), 277–311. Hauser, J.R., Liberali, G., Urban, G.L., 2014. Website morphing 2.0: switching costs, partial exposure, random exit, and when to morph. Management Science 60 (6), 1594–1616. Hermalin, B., Katz, M., 2006. Privacy, property rights and efficiency: the economics of privacy as secrecy. Quantitative Marketing and Economics 4 (3), 209–239. Hoban, P.R., Bucklin, R.E., 2015. Effects of Internet display advertising in the purchase funnel: modelbased insights from a randomized field experiment. Journal of Marketing Research 52 (3), 375–393. Hollenbeck, B., 2018. Online reputation mechanisms and the decreasing value of brands. Journal of Marketing Research 55 (5), 636–654. Honka, E., 2014. Quantifying search and switching costs in the US auto insurance industry. The Rand Journal of Economics 45 (4), 847–884. Honka, E., Chintagunta, P., 2017. Simultaneous or sequential? Search strategies in the U.S. auto insurance industry. Marketing Science 36 (1), 21–42. Horton, J.J., Zeckhauser, R.J., 2016. Owning, Using and Renting: Some Simple Economics of the “Sharing Economy”. NBER Working Paper No. 22029. Hossain, T., Morgan, J., 2006. ... Plus shipping and handling: revenue (non)equivalence in field experiments on eBay. Advances in Economic Analysis and Policy 6 (3). Hotelling, H., 1929. Stability in competition. The Economic Journal 39 (153), 41–57. Hui, X., Saeedi, M., Shen, Z., Sundaresan, N., 2016. Reputation and regulations: evidence from eBay. Management Science 62 (12), 3604–3616. Israeli, A., 2018. Online MAP enforcement: evidence from a quasi-experiment. Marketing Science, 539–564. Iyer, G., Soberman, D., Villas-Boas, M., 2005. The targeting of advertising. Marketing Science 24 (3), 461. Jeziorski, P., Moorthy, S., 2017. Advertiser prominence effects in search advertising. Management Science 64 (3), 1365–1383.


Johnson, E.J., Moe, W.W., Fader, P.S., Bellman, S., Lohse, G.L., 2004. On the depth and dynamics of online search behavior. Management Science 50 (3), 299–308. Johnson, G., 2014. The impact of privacy policy on the auction market for online display advertising. Available at SSRN 2333193. Johnson, G.A., Lewis, R.A., Nubbemeyer, E.I., 2017a. Ghost ads: improving the economics of measuring online ad effectiveness. Journal of Marketing Research 54 (6), 867–884. Johnson, G.A., Lewis, R.A., Reiley, D.H., 2017b. When less is more: data and power in advertising experiments. Marketing Science 36 (1), 43–53. Johnson, G.A., Shriver, S., Du, S., 2017c. Consumer Privacy Choice in Online Advertising: Who Opts Out and at What Cost to Industry? Mimeo. Boston University. Joo, M., Wilbur, K.C., Cowgill, B., Zhu, Y., 2014. Television advertising and online search. Management Science 60 (1), 56–73. Jullien, B., 2012. Two-sided B to B platforms. In: Peitz, M., Waldfogel, J. (Eds.), Oxford Handbook of the Digital Economy. Oxford University Press, pp. 161–185. Kalyanam, K., McAteer, J., Marek, J., Hodges, J., Lin, L., 2017. Cross channel effects of search engine advertising on brick & mortar retail sales: meta analysis of large scale field experiments on Google.com. Quantitative Marketing and Economics 16 (1), 1–42. Keller, K.L., 2003. Strategic Brand Management, second edition. Prentice Hall. Lambrecht, A., Misra, K., 2016. Fee or free: when should firms charge for online content? Management Science 63 (4), 1150–1165. Lambrecht, A., Tucker, C., 2013. When does retargeting work? Information specificity in online advertising. Journal of Marketing Research 50 (5), 561–576. Lambrecht, A., Tucker, C., Wiertz, C., 2018. Advertising to early trend propagators: evidence from Twitter. Marketing Science 37 (2), 177–199. Lambrecht, A., Tucker, C.E., 2018. Algorithmic bias? An empirical study into apparent gender-based discrimination in the display of STEM career ads. Management Science. Forthcoming. Levin, J., Milgrom, P., 2010. Online advertising: heterogeneity and conflation in market design. The American Economic Review 100 (2), 603–607. Lewis, R., Nguyen, D., 2015. Display advertising’s competitive spillovers to consumer search. Quantitative Marketing and Economics 13 (2), 93–115. Lewis, R.A., Justin, M.R., Reiley, D.H., 2011. Here, there, and everywhere: correlated online behaviors can lead to overestimates of the effects of advertising. In: Proceedings of the 20th ACM International World Wide Web Conference [WWW’11], pp. 157–166. Lewis, R.A., Rao, J.M., 2015. The unfavorable economics of measuring the returns to advertising. The Quarterly Journal of Economics 130 (4), 1941. Lewis, R.A., Reiley, D.H., 2014. Online ads and offline sales: measuring the effect of retail advertising via a controlled experiment on Yahoo! Quantitative Marketing and Economics 12 (3), 235–266. Li, H.A., Kannan, P.K., Viswanathan, S., Pani, A., 2016. Attribution strategies and return on keyword investment in paid search advertising. Marketing Science 35 (6), 831–848. Li, X., MacGarvie, M., Moser, P., 2015. Dead Poet’s Property – How Does Copyright Influence Price? Working Paper 21522. National Bureau of Economic Research. Liu, Y., Zhang, Z.J., 2006. Research note—the benefits of personalized pricing in a channel. Marketing Science 25 (1), 97–105. Lodish, L.M., Abraham, M., Kalmenson, S., Livelsberger, J., Lubetkin, B., Richardson, B., Stevens, M.E., 1995. How T.V. advertising works: a meta-analysis of 389 real world split cable T.V. advertising experiments. Journal of Marketing Research 32 (2), 125–139. Luca, M., 2016. Reviews, Reputation, and Revenue: The Case of Yelp.com. Harvard Business School NOM Unit Working Paper 12-016. Luca, M., Zervas, G., 2016. Fake it till you make it: reputation, competition, and Yelp review fraud. Management Science 62 (12), 3412–3427. Lucking-Reiley, D., Bryan, D., Prasad, N., Reeves, D., 2007. Pennies from eBay: the determinants of price in online auctions. Journal of Industrial Economics 55 (2), 223–233.



CHAPTER 5 Digital marketing

Manchanda, P., Dube, J.-P., Goh, K.Y., Chintagunta, P.K., 2006. The effect of banner advertising on Internet purchasing. Journal of Marketing Research 43 (1), 98–108. Mayzlin, D., Dover, Y., Chevalier, J., 2014. Promotional reviews: an empirical investigation of online review manipulation. The American Economic Review 104 (8), 2421–2455. Miller, A., Tucker, C., 2011. Can healthcare information technology save babies? Journal of Political Economy 119 (2), 289–324. Miller, A., Tucker, C., 2014. Health information exchange, system size and information silos. Journal of Health Economics 33 (2), 28–42. Miller, A.R., Tucker, C., 2009. Privacy protection and technology diffusion: the case of electronic medical records. Management Science 55 (7), 1077–1093. Miller, A.R., Tucker, C., 2018. Privacy protection, personalized medicine, and genetic testing. Management Science 64 (10), 4648–4668. Moe, W.W., Schweidel, D.A., 2012. Online product opinions: incidence, evaluation, and evolution. Marketing Science 31 (3), 372–386. Mortimer, J.H., Nosko, C., Sorensen, A., 2012. Supply responses to digital distribution: recorded music and live performances. Information Economics and Policy 24 (1), 3–14. Moshary, S., Blake, T., Sweeney, K., Tadelis, S., 2017. Price Salience and Product Choice. Working Paper. University of Pennsylvania. Muchnik, L., Aral, S., Taylor, S.J., 2013. Social influence bias: a randomized experiment. Science 341 (6146), 647–651. Murthi, B., Sarkar, S., 2003. The role of the management sciences in research on personalization. Management Science 49 (10), 1344–1362. Narayanan, S., Kalyanam, K., 2015. Position effects in search advertising and their moderators: a regression discontinuity approach. Marketing Science 34 (3), 388–407. Netzer, O., Feldman, R., Goldenberg, J., Fresko, M., 2012. Mine your own business: market-structure surveillance through text mining. Marketing Science 31 (3), 521–543. Nosko, C., Tadelis, S., 2015. The Limits of Reputation in Platform Markets: An Empirical Analysis and Field Experiment. Working Paper 20830. National Bureau of Economic Research. Orlov, E., 2011. How does the Internet influence price dispersion? Evidence from the airline industry. Journal of Industrial Economics 59 (1), 21–37. Peitz, M., Waelbroeck, P., 2006. Why the music industry may gain from free downloading – the role of sampling. International Journal of Industrial Organization 24 (5), 907–913. Peterson, R.A., Balasubramanian, S., Bronnenberg, B.J., 1997. Exploring the implications of the Internet for consumer marketing. Journal of the Academy of Marketing Science 25 (4), 329–346. Pope, D.G., Sydnor, J.R., 2011. What’s in a picture? Evidence of discrimination from Prosper.com. The Journal of Human Resources 46 (1), 53–92. Pozzi, A., 2013. E-commerce as a stockpiling technology: implications for consumer savings. International Journal of Industrial Organization 31 (6), 677–689. Prince, J., 2007. The beginning of online/retail competition and its origins: an application to personal computers. International Journal of Industrial Organization 25 (1), 139–156. Quan, T.W., Williams, K.R., 2018. Product variety, across-market demand heterogeneity, and the value of online retail. The Rand Journal of Economics 49 (4), 877–913. Rao, A., Hartmann, W.R., 2015. Quality vs. variety: trading larger screens for more shows in the era of digital cinema. Quantitative Marketing and Economics 13 (2), 117–134. Rao, V.R., 1984. Pricing research in marketing: the state of the art. Journal of Business, S39–S60. Reimers, I., 2016. Can private copyright protection be effective? Evidence from book publishing. The Journal of Law and Economics 59 (2), 411–440. Resnick, P., Zeckhauser, R., 2002. Trust among strangers in Internet transactions: empirical analysis form eBay auctions. In: Baye, M. (Ed.), Advances in Applied Microeconomics (vol. 11). Elsevier Science, Amsterdam, pp. 667–719. Rochet, J.-C., Tirole, J., 2003. Platform competition in two-sided markets. Journal of the European Economic Association 1 (4), 990–1029.


Rutz, O.J., Bucklin, R.E., 2012. Does banner advertising affect browsing for brands? Clickstream choice model says yes, for some. Quantitative Marketing and Economics 10 (2), 231–257. Sahni, N.S., 2015. Effect of temporal spacing between advertising exposures: evidence from online field experiments. Quantitative Marketing and Economics 13 (3), 203–247. Schwartz, E.M., Bradlow, E.T., Fader, P.S., 2017. Customer acquisition via display advertising using multiarmed bandit experiments. Marketing Science 36 (4), 500–522. Scott Morton, F., Zettelmeyer, F., Silva-Risso, J., 2003. Consumer information and discrimination: does the Internet affect the pricing of new cars to women and minorities? Quantitative Marketing and Economics 1 (1), 65–92. Seamans, R., Zhu, F., 2014. Responses to entry in multisided markets. The impact of craigslist on newspapers. Management Science 60 (2), 476–493. Seim, K., Sinkinson, M., 2016. Mixed pricing in online marketplaces. Quantitative Marketing and Economics 14 (2), 129–155. Shapiro, C., 1983. Premiums for high quality products as returns to reputation. The Quarterly Journal of Economics 98 (4), 659–680. Shapiro, C., Varian, H.R., 1998. Information Rules: A Strategic Guide to the Network Economy. Harvard Business School Press, Boston. Shin, J., Sudhir, K., 2010. A customer management dilemma: when is it profitable to reward one’s own customers? Marketing Science 21 (4), 671–689. Simonov, A., Nosko, C., Rao, J.M., 2018a. Competition and crowd-out for brand keywords in sponsored search. Marketing Science 37 (2), 200–215. Simonov, A., Nosko, C., Rao, J.M., 2018b. Competition and crowd-out for brand keywords in sponsored search. Marketing Science 37 (2), 200–215. Sinai, T., Waldfogel, J., 2004. Geography and the Internet: is the Internet a substitute or a complement for cities? Journal of Urban Economics 56 (1), 1–24. Smith, M.D., Bailey, J., Brynjolfsson, E., 2001. Understanding digital markets: review and assessment. In: Brynjolfsson, E., Kahin, B. (Eds.), Understanding the Digital Economy: Data, Tools, and Research. MIT Press, pp. 99–136. Sridhar, S., Sriram, S., 2015. Is online newspaper advertising cannibalizing print advertising? Quantitative Marketing and Economics 13 (4), 283–318. Stigler, G.J., 1961. The economics of information. Journal of Political Economy 69 (3), 213–225. Sweeney, L., 2013. Discrimination in online ad delivery. ACM Queue 11 (3), 10. Tadelis, S., 1999. What’s in a name? Reputation as a tradeable asset. The American Economic Review 89 (3), 548–563. Taylor, C.R., 2004. Consumer privacy and the market for customer information. The Rand Journal of Economics 35 (4), 631–650. Toubia, O., Stephen, A.T., 2013. Intrinsic vs. image-related utility in social media: why do people contribute content to Twitter? Marketing Science 32 (3), 368–392. Tucker, C., 2012. The economics of advertising and privacy. International Journal of Industrial Organization 30 (3), 326–329. Tucker, C., 2014. Social networks, personalized advertising, and privacy controls. Journal of Marketing Research 51 (5), 546–562. Tucker, C., Zhang, J., 2011. How does popularity information affect choices? A field experiment. Management Science 57 (5), 828–842. Valentino-Devries, J., Singer-Vine, J., Soltan, A., 2012. Websites vary prices, deals based on users’ information. The Wall Street Journal. Varian, H., 2007. Position auctions. International Journal of Industrial Organization 25 (6), 1163–1178. Varian, H.R., 1980. A model of sales. The American Economic Review 70 (4), 651–659. Varian, H.R., 2005. Copying and copyright. The Journal of Economic Perspectives 19 (2), 121–138. Verhoef, P.C., Kannan, P.K., Inman, J.J., 2015. From multi-channel retailing to omni-channel retailing: introduction to the special issue on multi-channel retailing. Journal of Retailing 91 (2), 174–181. Vernik, D.A., Purohit, D., Desai, P.S., 2011. Music downloads and the flip side of digital rights management. Marketing Science 30 (6), 1011–1027.



CHAPTER 5 Digital marketing

Villas-Boas, J.M., 2004. Price cycles in markets with customer recognition. The Rand Journal of Economics 35 (3), 486–501. Waldfogel, J., 2010. Music file sharing and sales displacement in the iTunes era. Information Economics and Policy 22 (4), 306–314. Waldfogel, J., 2012. Copyright research in the digital age: moving from piracy to the supply of new products. The American Economic Review 102 (3), 337–342. Waldfogel, J., 2016. Cinematic explosion: new products, unpredictability and realized quality in the digital era. Journal of Industrial Economics 64 (4), 755–772. Waldfogel, J., 2018. Digital Renaissance: What Data and Economics Tell Us About the Future of Popular Culture. Princeton University Press. Waldfogel, J., Chen, L., 2006. Does information undermine brand? Information intermediary use and preference for branded web retailers. Journal of Industrial Economics 54 (4), 425–449. Waldfogel, J., Reimers, I., 2015. Storming the gatekeepers: digital disintermediation in the market for books. Information Economics and Policy 31 (C), 47–58. Wang, K., Goldfarb, A., 2017. Can offline stores drive online sales? Journal of Marketing Research 54 (5), 706–719. Xu, K., Chan, J., Ghose, A., Han, S.P., 2017. Battle of the channels: the impact of tablets on digital commerce. Management Science 63 (5), 1469–1492. Yao, S., Mela, C.F., 2011. A dynamic model of sponsored search advertising. Marketing Science 30 (3), 447–468. Yoganarasimhan, H., 2012. Impact of social network structure on content propagation: a study using YouTube data. Quantitative Marketing and Economics 10 (1), 111–150. Zentner, A., 2006. Measuring the effect of file sharing on music purchases. The Journal of Law and Economics 49 (1), 63–90. Zentner, A., Smith, M., Kaya, C., 2013. How video rental patterns change as consumers move online. Management Science 59 (11), 2622–2634. Zervas, G., Proserpio, D., Byers, J.W., 2017. The rise of the sharing economy: estimating the impact of Airbnb on the hotel industry. Journal of Marketing Research 54 (5), 687–705. Zettelmeyer, F., Scott Morton, F., Silva-Risso, J., 2001. Internet car retailing. Journal of Industrial Economics 49 (4), 501–519. Zettelmeyer, F., Scott Morton, F., Silva-Risso, J., 2006. How the Internet lowers prices: evidence from matched survey and automobile transaction data. Journal of Marketing Research 43 (2), 168–181. Zhang, J., Liu, P., 2012. Rational herding in microloan markets. Management Science 58 (5), 892–912. Zhang, L., 2018. Intellectual property strategy and the long tail: evidence from the recorded music industry. Management Science 64 (1), 24–42. Zhu, Y., Wilbur, K.C., 2011. Hybrid advertising auctions. Marketing Science 30 (2), 249–273.


The economics of brands and branding✩


Bart J. Bronnenberga,b,∗ , Jean-Pierre Dubéc,d , Sridhar Moorthye a Tilburg

School of Economics and Management, Tilburg University, Tilburg, The Netherlands b CEPR, London, United Kingdom c Booth School of Business, University of Chicago, Chicago, IL, United States d NBER, Cambridge, MA, United States e Rotman School of Management, University of Toronto, Toronto, ON, Canada ∗ Corresponding author: e-mail address: [email protected]

Contents 1 Introduction ...................................................................................... 2 Brand equity and consumer demand ......................................................... 2.1 Consumer brand equity as a product characteristic .......................... 2.2 Brand awareness, consideration, and consumer search ..................... 2.2.1 The consumer psychology view on awareness, consideration, and brand choice..................................................................... 2.2.2 Integrating awareness and consideration into the demand model .... 2.2.3 An econometric specification................................................. 2.2.4 Consideration and brand valuation .......................................... 3 Consumer brand loyalty ........................................................................ 3.1 A general model of brand loyalty ................................................ 3.2 Evidence of brand choice inertia ................................................ 3.3 Brand choice inertia, switching costs, and loyalty ............................ 3.4 Learning from experience ........................................................ 3.5 Brand advertising goodwill ....................................................... 4 Brand value to firms............................................................................. 4.1 Brands and market structure..................................................... 4.2 Measuring brand value............................................................ 4.2.1 Reduced-form approaches using price and revenue premia .......... 4.2.2 Structural models ............................................................... 5 Branding and firm strategy ..................................................................... 5.1 Brand as a product characteristic ............................................... 5.2 Brands and reputation ............................................................ 5.3 Branding as a signal...............................................................

292 293 293 299 299 302 304 306 307 307 308 311 314 318 319 319 321 322 323 327 328 331 335

✩ Dubé acknowledges the support of the Kilts Center for Marketing and Moorthy acknowledges the

support of the Social Sciences and Humanities Research Council of Canada. The authors thank Tülin Erdem, Pedro Gardete, Avi Goldfarb, Brett Hollenbeck, Carl Mela, Helena Pedrotti, Martin Peitz, Sudhir Voleti, and two anonymous reviewers for comments and suggestions. Handbook of the Economics of Marketing, Volume 1, ISSN 2452-2619, https://doi.org/10.1016/bs.hem.2019.04.003 Copyright © 2019 Elsevier B.V. All rights reserved.



CHAPTER 6 The economics of brands and branding

5.4 Umbrella branding................................................................. 5.4.1 Empirical evidence ............................................................. 5.4.2 Umbrella branding and reputation .......................................... 5.4.3 Umbrella branding and product quality signaling ........................ 5.5 Brand loyalty and equilibrium pricing .......................................... 5.6 Brand loyalty and early-mover advantage ...................................... 6 Conclusions ...................................................................................... References............................................................................................

338 338 339 341 344 345 347 349

1 Introduction The economics literature has long puzzled over the concept of brand preference and consumer willingness to pay a price premium for a product differentiated by little other than its brand. In blind taste tests, consumers are often unable to distinguish between their preferred brands and other competing products (Husband and Godfrey, 1934; Thumin, 1962; Allison and Uhl, 1964, p. 336). Nevertheless, branding and brand advertising are perceived to be important investments in sustainable market power: “A well-known soap-flake which is a branded article costs £150,000 per year to advertise. The price of two unadvertised soap-flakes is considerably less (one of them by more than 50 per cent) than that of the advertised product. Chemically there is absolutely no difference between the advertised product and the two unadvertised soap-flakes. Advertisement alone maintains the fiction that this soap-flake is something superfine. If the advertisement were stopped, the product would become merely one of a number of soap-flakes and would have to be sold at ordinary soap-flake prices. Yet the success of the undertaking, from the producer’s point of view, may be seen from the fact that this product brings in half a million net profit per year.” (Braithwaite, 1928, p. 30)

Brands have also long been recognized as invaluable assets to firms that create barriers to entry and contribute to supranormal economic profits: “The advantage to established sellers accruing from buyer preferences for their products as opposed to potential-entrant products is on the average larger and more frequent in occurrence at large values than any other barrier to entry.” (Bain, 1956, p. 216)

The conceptual meaning of a brand has evolved over time. According to the Oxford English Dictionary, the word “brand” originated in the 10th century. During the 1600s, the term brand was used in the American colonies to designate “mark of ownership impressed on cattle” (Kerschner and Geraghty, 2006, p. 21). Since the nineteenth century, the term brand has taken on a commercial role as “a trademark, whether made by burning or otherwise” on items ranging from wine and liquor to timber and metals (Murray, 1887, p. 1055). Current marketing practice interprets the

2 Brand equity and consumer demand

brand as “a name, symbol, design, or mark that enhances the value of a product beyond its functional purpose” where the added value of these enhancements to the basic product are often broadly termed “brand equity” (Farquar, 1989, p. 24). On the demand side, this added value can comprise consumption benefits such as image and information benefits such as quality reputation. On the supply side, industry experts associate very high economic value with the commercial rights to leading brands, with reported valuations in the billions of US dollars.1,2 This chapter discusses the economics of brands and branding to understand their impact on the formation of industrial market structures in consumer goods industries. We review the academic literature analyzing the underlying economic mechanisms through which consumers form brand preferences, on the demand side, and the economic incentives for firms to invest in the creation and maintenance of brands, on the supply side. Our discussion builds on earlier surveys of marketing science models of brand equity (e.g., Erdem and Swait, 2014). However, we refer readers seeking a psychological foundation of brands and branding to Muthukrishnan (2015) and Schmitt (2012). We have organized this chapter around the following topics. Section 2 discusses two principal roles of brands in affecting demand. First, we discuss how brands affect preferences, and incorporate brand preferences into a neoclassical “characteristics” model of demand. Here, we discuss how consumer brand equity is estimated from consumer choice data. Second, we discuss the role of brands in generating awareness, directing attention and consumer search, and determining the composition of the consideration sets from which brand choices are made. In Section 3, we focus on the formation of consumer brand preferences over time and the emergence of “brand loyalty.” Section 4 discusses brand value estimation from the firm’s point of view. In Section 5 we discuss the strategic considerations for firms to create brand value through reputations, the investment in brand capital, and potentially extending the use of a brand name across products marketed under a common brand umbrella. Finally, Section 6 concludes.

2 Brand equity and consumer demand 2.1 Consumer brand equity as a product characteristic In this subsection, we focus on characteristics models of demand and the role of brand as a quality-enhancing product feature. The incorporation of product quality

1 For instance, according to Forbes magazine, the 100 most valuable brands in 2017 represent a global value of US$ 1.95 trillion. 2 The broad use of the term “brand equity” in reference to both consumer and firm benefits creates confusion. In some of the literature, the added value of brand enhancements to consumers are termed “brand equity” whereas the add value of brand enhancements to firms are termed “brand value” (see for instance Goldfarb et al., 2008).



CHAPTER 6 The economics of brands and branding

into the modeling of consumer preferences represented a turning point in the consumption literature, allowing for a more granular analysis of product-level demand as opposed to commodity-group-level demand (Houthakker, 1953). The role of quality was formalized into a “characteristics approach.” The product is defined as a bundle of characteristics. Consumers have identical perceptions of the objectively measured characteristics comprising a product, and have potentially heterogeneous and subjective preferences for these characteristics (Lancaster, 1971; Baumol, 1967; Rosen, 1974). Early work in characteristics models of demand focused purely on objective attributes and did not consider brand. Unlike objective product characteristics, consumer brand preferences (or “brand equity”) typically comprises intangible, psychological factors and benefits. For instance, Keller’s (1993, p. 3) conceptual model of brand equity starts with a consumer’s brand knowledge, or “brand node in her memory to which a variety of associations are linked.” These associations in memory include the consumer’s brand awareness and her perceptions of the brand, or “brand image.” But, the psychological mechanism through which brand equity affects a consumer’s utility from a product presents a challenge for the neoclassical economic model. Economists have historically shied away from the psychological foundations of preferences: “The economist has little to say about the formation of wants; this is the province of the psychologist. The economist’s task is to trace the consequences of a given set of wants.” Friedman (1962, p. 13)

Not surprisingly, early micro-econometric work took a simplified view of the brand as a mark that merely identifies a specific product and links it to a supplier. In his hedonic specification, Rosen (1974, p. 36) explained: “The terms ‘product,’ ‘model,’ ‘brand,’ and ‘design’ are used interchangeably to designate commodities of given quality or specification.” Accordingly, Rosen (1974, p. 37) assumed: “If two brands offer the same bundle, but sell for different prices, consumers only consider the less expensive one, and the identity of sellers is irrelevant to their purchase decisions.” So the traditional characteristics approach assumes the consumer derives no preference from the brand itself other than through the objective product characteristics. Brand choice (i.e. “demand”) is therefore governed entirely by the brand’s objective characteristics. A micro-econometric demand specification that excludes brand preferences would have limited predictive power in many product markets. According to the standard characteristics model, “two brands which have approximately the same attribute values should have approximately the same market shares” (Srinivasan, 1979, p. 12), a prediction that is frequently rejected by actual market share data (see, for example, the brand share analysis in Bronnenberg et al., 2007). Blind taste tests with experienced consumers also reveal a strong role for brand. In comparisons of blinded and unblinded taste tests that hold all the attributes of popular national brands fixed except the brand labeling on the packaging, experienced consumers routinely exhibit different preference orderings (Husband and Godfrey, 1934; Thumin, 1962; Allison

2 Brand equity and consumer demand

and Uhl, 1964). In Allison and Uhl’s (1964) study, subjects—males who drank beer at least three times a week—tasted six bottles of beer over a week, first blind, with no brand identifiers, and then non-blind, with all the brand identifiers present. In the blind tasting, the six bottles of beer were actually three different brands with “taste differences discernible to expert taste testers.” In the non-blind tasting, the six bottles were actually six different brands—the three that they had originally tasted blind, plus three additional brands. After each tasting, subjects were asked to evaluate the beers, overall, and on particular product attributes such as “after-taste,” “aroma,” and “carbonation.” In the blind tasting, subjects generally rated all the beers to be about the same quality—including the brand that they drank most often. However, unblinded, subjects rated each of the original three beer brands higher and the increases in evaluation varied across brands. Subjects generally rated “their” brands as significantly better than the others even though they could not distinguish them in the blind test. Allison and Uhl conclude: “Participants, in general, did not appear to be able to discern the taste differences among the various beer brands, but apparently labels, and their associations, did influence their evaluations.3 Ratchford (1975) was an early study that acknowledged the close connection between the characteristics approach in economics and the multi-attribute psychometric approaches (e.g., Green and Srinivasan, 1978; Wilkie and Pessemier, 1974) used in consumer psychology to describe and measure brand preferences and brand attitudes. The lab-based nature of psychometric and stated-preference measures limited their broad applicability to the analysis of consumer purchase data in the field. A parallel literature using stated-preference data, or conjoint analysis,4 instead defined the brand equity as a residual which can be measured as a separate brand fixed effect in addition to the other objective product characteristics (Green and Wind, 1975; Srinivasan, 1979). With the advent of consumer shopping panel data, this same approach to brand equity was incorporated into empirical brand choice models (Guadagni and Little, 1983) derived from random utility theory. We now explore such quantitative models of brand choice. More formally, we consider the following discrete choice or “brand choice” formulation of demand. Suppose consumers have unit-elastic demands for j = 1, ..., J perfectly substitutable branded goods in a category.5 We also allow for a J + 1st “outside good” which we interpret for now as the non-purchase choice. Assume the consumer derives the following choice-specific utilities (i.e., conditional indirect util-

3 These findings would later inspire the famous “Pepsi Challenge” campaign during the 1970s in which

subjects exhibited a more than 50% chance of choosing Pepsi over Coca Cola in a blind taste test (http:// www.businessinsider.com/pepsi-challenge-business-insider-2013-5). 4 The conjoint approach to preference estimation defines consumers’ product preference by conjoining her tastes for the product’s underlying attributes, much like the “characteristics approach.” 5 Following Rosen (1974), we make the discrete choice assumption for ease of presentation.



CHAPTER 6 The economics of brands and branding


  vj = U ψj , y − pj + εj , j = 1, ..., J (1) vJ +1 = U (0, y) + εJ +1

where εj is a random utility component for product j , y is the consumer’s budget, and pj is the price of product j . It is straightforward to include additional controls for point-of-sale marketing variables, such as in-store merchandizing like displays, in the model. The key object of interest in our discussion of brand preference is ψj , the consumer’s total perceived value from brand j (Guadagni and Little, 1983; Kamakura and Russell, 1989; Louviere and Johnson, 1988; Kamakura and Russell, 1993). In principle, the sign and magnitude of ψj can vary across customers so that branding can lead to both horizontal and vertical sources of differentiation. In Section 5, we discuss how firms endogenously make branding decisions on the supply side. The brand choice literature has proposed various methods to extract the intrinsic perceived brand value from ψj . Kamakura and Russell (1993) propose a framework to reconcile the gap between the psychological components of brand preference and the objective product attributes. They use a hierarchical structure that decomposes total brand value as follows ψ j = x j β + γj


where xj are the objectively measured product attributes, β is a vector of corresponding attribute tastes, and γj is an intrinsic utility for the intangible and psychological components of brand j . For the remainder of our discussion, we will refer to γj as the intrinsic value of brand j in reference to the added benefits beyond the usual consumption benefits associated with the objective attributes, xj . This decomposition reveals a potential identification problem if the attributes of a given brand j do not vary over time or across consumer. In this case, the marginal utilities of all the attributes, β, and the perceptual features of brand j , γj , are not separately identified. Kamakura and Russell (1993) impose additional parameter restrictions to resolve the problem in an application to consumer purchase data. Using stated-preference data, such as a conjoint experiment, circumvents the problem by randomizing the attributes xj . Alternatively, when more granular, individual product- or so called stock-keepingunit (SKU)-level data are available, the researcher can exploit the fact that a common brand name may be applied across multiple SKUs with different objective attributes such as pack size, packaging format, and flavor (e.g., Fader and Hardie, 1996). In Section 5, we discuss how firms can create brand differentiation through reputation even when products are otherwise undifferentiated (i.e., xj = xk , ∀j, k for all objective attributes). The total intrinsic brand value in Eq. (2) can be augmented to include subjective and perceptual aspects of the brand, such as biases in consumer perceptions of the objective attributes (Park and Srinivasan, 1994) and image associations. Typically these psychological attributes are elicited through consumer surveys (see Keller and Lehmann, 2006 for a discussion).

2 Brand equity and consumer demand

The estimated residual, γˆj , is typically interpreted as the brand equity or brandrelated value. We can then derive a natural micro-econometric measure of the total economic value to the consumer associated with the brand equity of product j using the classic Hicksian compensating differential (Hicks, 1939). The Hicksian compensating differential consists of the monetary transfer to the consumer to make her indifferent between the factual choice set in which brand j offers brand equity and a counter-factual choice set in which brand j no longer offers brand equity, all else equal. Researchers often use the term willingness-to-pay (WTP) since the compensating differential is equivalent to the maximum dollar amount a customer should objectively be willing to pay to retain brand j ’s equity in the choice set,   all else equal.6 Suppose we assume quasi-linear utility, U ψj , y − pj = ψj + θ y − pj , where θ is the marginal utility of income (see Chapter 1 of this volume for more discussion). When the random utility shocks are also assumed ε ∼ i.i.d. EV(0, 1), we get the multinomial logit demand system and the willingness-to-pay for brand j ’s equity is:   J  exp (U (·; β, γk ))  ln 1 + k=1 brand WTPj = θ   J     exp U ·; β, γk=j , γj = 0 ln 1 + −



dF ()


where F () is the distribution reflecting the researcher’s statistical uncertainty over all the model parameters, .7 Swait et al. (1993) propose a similar measure, termed “Equalization Price,” which measures the compensating differential without taking into account the role of the random utility shocks, ε. Since the estimation of the brand intercepts typically requires a normalization, the exact interpretation of WTPbrand j depends on the definition of the base choice against which the brand intercepts are measured, typically the “no purchase” choice which is assumed to offer no brand equity. A more comprehensive set of survey-based, perceptual measures such as brand attitudes, consumer opinions, perceived fit, and reliability can also be incorporated into the analysis (e.g., Swait et al., 1993). In practice, some researchers use a simpler monetary measure of the brand equity based on the equivalent price reduction (e.g., Louviere and Johnson, 1988; Sonnier et al., 2007): γj BEj = . (4) θ 6 The terminology WTP dates back at least to Trajtenberg (1989) and is used throughout the literature on the value of new products and the value of product features. 7 Many applications also allow the equilibrium prices to adjust in response to the demand shift associated with the removal of brand j ’s equity, γj = 0.



CHAPTER 6 The economics of brands and branding

Holding all else constant, this price reduction ensures that the consumer has the same expected probability of buying brand j in the counterfactual scenario where γj = 0, i.e., where the intrinsic utility for the intangible, psychological components of brand j are absent. In practice, researchers typically plug point estimates of γ and θ in (4). Formally, one ought to use the correct expected incremental utility that takes into account the statistical uncertainty in the estimates:  γj BEj = dF () (5) θ where, as before, F () is the distribution reflecting the researcher’s statistical uncertainty over the model parameters, . It is straightforward to show that (5) is identical to the willingness-to-pay for brand j ’s equity only in the extreme case where consumer utility is deterministic (i.e. there is no random utility component), brand j is the only available product and the consumer is forced to purchase it.8 Another advantage of using WTPbrand as in (3) versus (5) to measure brand equity is that the j former will vary depending on how we combine brand j with other product features and prices. In many demand studies, the intrinsic brand value, γj , is treated as a nuisance parameter that controls for all the intangible aspects of a product that are either difficult or impossible to measure objectively. In this regard, the brand intercepts improve the predictive power of the model. The non-parametric manner in which γj controls for brand preference is, however, both a blessing and a curse. Brand value research often interprets the estimated residual, γˆj , as marketing-based component of brand equity (e.g., Park and Srinivasan, 1994), in contrast with the product-based component captured by the objective product attributes, xj . An obvious limitation of this approach is that any omitted product characteristics will be loaded into γˆj . So brand equity measures like (4) and (3) should probably be interpreted as noisy measures of the marketing-based component of brand equity. An additional limitation is that the model treats perceived brand equity as a static feature of the product without providing any insight into the formation of brand preferences. In Section 5.2, we discuss an alternative informative theory of brand equity that assumes there is no intrinsic brand preference. Rather, the brand name conveys an informative signal used by the consumer to infer product quality through the brand’s marketing and/or reputation.9 In Section 3, we extend our discussion to a dynamic setting in which the consumer’s 8 Once we include random utility shocks into the model, BE is no-longer a welfare measure. j 9 As noted by Nelson (1974) and Sutton (1991), firms spend large amounts of money in seemingly unin-

formative advertising in categories that are essentially commodities. While Nelson (1974) has argued that this sort of advertising might signal product quality indirectly, the empirical evidence for his hypothesis is mixed at best. Caves and Greene (1996a, p. 50), after examining 196 product categories, conclude that “[t]hese results suggest that quality-signalling is not the function of most advertising of consumer goods”; Bagwell’s (2007, p. 1748) review is only slightly more circumspect: “... the studies described here do not offer strong support for the hypothesis of a systematic positive relationship between advertising and product quality.”

2 Brand equity and consumer demand

brand equity evolves over time through her past consumption and marketing experiences. A final concern is that the measures above fail to account for the supply side. If demand for a branded good is fundamentally altered or if a branded good is excluded from the market, then equilibrium prices would likely re-adjust on the supply side (along with other marketing decisions). In Section 4, we discuss firms’ branding strategies. The “characteristics approach” to consumer brand value described above is the most common approach to deriving and measuring the economic value for a brand. Becker and Murphy (1993) proposed an alternative “complementary goods” theory of brand value whereby the market good and its brand are both complementary consumption goods in the sense of Hicks and Allen (1934). In this framework, brand equity would need to be defined through the complementary effect of the consumption of the brand/branding and the consumption of the corresponding physical good. To the best of our knowledge, Kamenica et al. (2013) provide the only direct evidence for this theory of brands.10 They conduct randomized clinical trials to test whether the treatment effect of direct-to-consumer advertising has a causal effect on a subject’s physiological reaction to a drug. In particular, a branded antihistamine was found to be more effective when subjects were exposed to that brand’s advertising as opposed to a competitor brand’s advertising. Tests for brand effects as complementary goods (Becker and Murphy, 1993) and the specification of a demand system with complementary goods (e.g., Song and Chintagunta, 2007) are beyond the scope of this chapter.

2.2 Brand awareness, consideration, and consumer search 2.2.1 The consumer psychology view on awareness, consideration, and brand choice The Lancasterian model described above takes the extreme view that the consumer has complete information about the set of available brands and their attributes at the time of purchase. In the psychology literature on consumer behavior, Lynch and Srull (1982) refer to this scenario as “stimulus-based choice.” At the opposite extreme, the consumer uses a pure “memory-based choice” (Bettman, 1979), whereby all relevant choice information must be recalled from memory. As explained in Alba et al. (1991), in practice most brand purchase contexts will require at least some reliance on recalled information. Even when the purchase environment (e.g. the shelf display in a store) contains all the relevant brand and attribute information, the complexity

10 An indirect test of Becker and Murphy’s (1993) theory exploits the Slutsky symmetry condition by test-

ing whether a shift in demand for the physical good increases the consumption of the brand’s advertising. Tuchman et al. (2015) use data that match household-level time-shifted television viewing on digital video recorders with in-store shopping behavior. They find that in-store promotions that increase a household’s consumption of a brand cause an increase in the household’s propensity to watch (i.e. not skip) that same brand’s commercials.



CHAPTER 6 The economics of brands and branding

of the task, the ease with which certain brands are noticed relative to others, and the consumer’s time cost or effort can all lead to reliance on recalled information. In the brand choice literature, researchers studying brand choice under incomplete information distinguish between limited awareness and consideration. A consumer’s brand awareness comprises the set of brands recalled from her memory. This set may be much broader than the subset of brands the consumer evaluates more seriously for choice (Campbell, 1969), the so-called “evoked set” (Howard and Sheth, 1969), or “consideration set” (Wright and Barbour, 1977). However, the concept of awareness precedes that of consideration, i.e., any brand associations that facilitate recall will, in turn, influence a brand’s inclusion in the consideration set (Keller, 1993).

Awareness The extent of awareness for a given brand in the market has been studied in a number of ways. Laurent et al. (1995) define the unaided awareness for a brand as the fraction of households who spontaneously recall a specific brand when asked about choice options in a category. A related measure, top-of-mind awareness for a brand, indicates the fraction of households who spontaneously recall that brand as the first one when prompted. Aided awareness measures the fraction of households that recognize a specific brand name from a given list of brands in a category. Most studies find that a consumer’s brand awareness within a product category is quite limited. Laurent et al. (1995) report that unaided brand awareness in a product category is 15-20%, even for those brand names recognized by 75% of consumers once prompted. In addition to being limited, the literature also reports that unaided brand awareness varies over time within households (Day and Pratt, 1971; Draganska and Klapper, 2011). The relevance of awareness in this chapter on branding stems from research showing that a consumer’s ability to recall a specific product from memory is affected by the corresponding brand name. For instance, preferred brands tend to be recalled earlier than non-preferred brands (Axelrod, 1968; Nedungadi and Hutchison, 1985). Further, as consumers accumulate more knowledge about a product category, they tend to structure their memory around brands (Bettman and Park, 1980), largely due to the fact that for consumer goods, most experiences are brand-based (e.g., advertising, in-store merchandizing, and consumption experience). Even factors as simple as lack of name recognition can block a brand from being recalled and, subsequently, from entering a consumer’s consideration set (see the discussion in, e.g., Alba et al., 1991). The branding literature has viewed brand awareness as a necessary but insufficient condition for brand consideration and choice, at least since Axelrod (1968). In a study of the German ground coffee market, Draganska and Klapper (2011) report that even in a heavily advertised category like coffee, the typical consumer spontaneously recalls only three brands from the total available set consisting of five major national brands and many fringe brands. Furthermore, Draganska and Klapper (2011) report that the set of recalled brands varies across respondents and accounts for a large part of the heterogeneity in choices (we deliberately avoid using the term “heterogeneity in preferences”).

2 Brand equity and consumer demand

Consumer psychologists assign a distinct role to brand awareness versus brand preferences on brand choices. For instance, in lab experiments that manipulate the level of brand awareness and product quality, Hoyer and Brown (1990) find that subjects picked the “familiar” brand 77% of the time, even though the familiar brand was frequently not the one with the highest quality. Surprisingly, subjects were more likely to choose the high-quality alternative when none of the brands was “familiar.” Nedungadi (1990) also finds that choice outcomes can be affected by factors that affect brand recall but not brand preference. Consumer expertise likely moderates these effects. For instance, Heilman et al. (2000) find that first-time consumers in a product category are more likely to purchase more familiar brands than experienced consumers.

Consideration A separate literature has explicitly studied consumers’ brand consideration sets for choice. Even though a consumer may be aware of a number of brands, she may only consider a subset of them on any given purchase occasion (Narayana and Markin, 1975; Bettman and Park, 1980; Ratchford, 1980; Shugan, 1980). For an overview of early literature on empirical consideration sets, see the discussion in Shocker et al. (1991). In an empirical study of brand choices, Hauser (1978) found that consumers’ consideration sets explained 78% of the variance in their brand choices; only 22% was explained by preferences within consideration sets. Empirical researchers have typically found that consideration sets in brand choice settings range in size from only 2 to 8 alternatives (Bronnenberg et al., 2016; Hauser and Wernerfelt, 1990; Honka, 2014; Moorthy et al., 1997; Newman and Staelin, 1972; Punj and Staelin, 1983; Ratchford et al., 2007; Urban, 1975). These limited consideration sets are consistent with the psychological theory that individuals’ ability to evaluate choices may be cognitively limited to a maximum of about seven (Miller, 1956). In sum, consumer psychologists make a distinction between brand awareness, which is recalled from memory, and brand consideration, which reflects the consumer’s deliberation process of narrowing down the set of options before making a brand choice. The literature has further documented strikingly limited degrees of awareness and consideration. An interesting direction for future research might consist of testing the extent to which the limited varieties purchased by households in most categories of consumer goods reflects a lack of awareness of all the available brands. One recent study of over 32 million US shoppers found that over a 52-week period ending in June 2013, even the most frequent shoppers purchased only 260 of the 35,372 stock-keeping units available in supermarkets, about 0.7%. Across category, the amount varied from as low as 0.2% in Health & Beauty to as high as 1.7% in Dairy (Catalina Media, 2013). Moreover, awareness has been shown to influence brand choices independently of brand preferences. The literature has not yet studied whether consumers rationally plan their awareness, by informing themselves strategically about brands or, alternatively, whether this awareness set is exogenous to consumer decision making.



CHAPTER 6 The economics of brands and branding

A recent empirical literature has used data on consumers’ consideration sets to show that the assumption that consumers consider all the available brands in a market will likely result in biased estimates of brand preferences. In practice, if consumers are more likely to consider branded goods, which typically charge higher prices, a naive model of full consideration may generate a downward bias in the estimated price sensitivity (Honka, 2014). Similarly, a naive model that ignores the consideration stage may generate an upward bias in the degree of estimated preference heterogeneity (Dong et al., 2017).

2.2.2 Integrating awareness and consideration into the demand model We now formalize the notions of awareness and consideration into our economic model of consumer demand. We build on the Lancasterian framework from Section 2.1 that assumed consumers were fully aware of all available brands and considered each variant for choice. Throughout this section, we maintain the assumption that all products in a category are perfect substitutes and, hence, that consumers will make a pure discrete choice purchase decision. We assume that at the time of purchase from a commodity group, the consumer recalls the brand alternatives in the set Sa ⊆ S, where S is the full set of available products; but is unaware of brand alternatives in its complement \Sa = S − Sa . The consumer is uninformed about the availability or the characteristics of products in the complement of Sa and does not take this uncertainty into account when making a decision. In this section, we take a static view of the consumer that treats Sa as exogenous.11 In Section 2.2.3 below, we also allow for a consumer’s awareness set, Sa , to be influenced by the endogenous branding and marketing activities of firms, on the supply side. Conditional on her awareness set Sa , we assume the consumer’s purchase decision is the outcome of a two-stage sequential process: (1) the search and evaluation stage, and (2) the choice stage.12 During the first stage, or search and evaluation stage, the consumer forms her consideration set, Sc ⊆ Sa by evaluating ex ante which products to include in the consideration set so as to maximize her expected consumption utility net of search and evaluation costs (Shugan, 1980; Roberts, 1989; Roberts and Lattin, 1991).13 Formally,

   Sc = arg max E max vj − C (Sc ) , Sc ⊆Sa

j ∈Sc


11 In Section 3 below, we also offer a dynamic view of the consumer that considers her purchase history. In the multi-period setting, awareness can form endogenously through a consumer’s past purchase experiences. 12 A broad literature has documented evidence that consumers use such two-step “consider-then-choose” decision-making (e.g., Payne, 1976, 1982). 13 While consumers generally use cost/benefit decision rules, they may rely on simpler heuristic approaches in situations with more complex decision tasks (e.g., Payne, 1976, 1982).

2 Brand equity and consumer demand

where vj is the indirect utility from consuming alternative j , and C (Sc ) is a product evaluation or search cost associated with assembling the consideration set Sc . Unlike the traditional Lancasterian model in which branding affected choices through preferences for the branded goods, we now consider the possibility that branding plays another, complementary, role. Let the incremental cost of gathering information for brand j be denoted by Cj . We allow the cost of gathering and/or interpreting information Cj to depend on branding. This effect of branding could reflect explicit factors at the point of purchase, such as shelf placement, that can aid consumers in processing information. It could also reflect branding efforts outside the store, like advertising, which may affect the consumer’s ability to recall a specific brand from memory. The exact composition of the consideration set Sc depends on the consumer’s search conduct. The traditional search literature assumes that consumers randomly sample information (e.g. prices) from a set of ex-ante identical sellers at a fixed and constant cost for each of the firms sampled (Stigler, 1961). Weitzman (1979) was the first to consider search over differentiated products, allowing consumers to prioritize search systematically for those products with the highest anticipated utility. If awareness or familiarity make brands easier to search, then consumers prioritize their search across brands in an order determined by brand awareness, or “prominence” (Arbatskaya, 2007). As Armstrong and Zhou (2011) put it: “In many circumstances, however, consumers consider options in a non-random manner and choose first to investigate those sellers or products which have high brand recognition, which are recommended by an intermediary, which are prominently displayed within a retail environment, which are known to have low price, or from which the consumer has purchased previously.” (Armstrong and Zhou, 2011, p. F368)

Similarly, Erdem and Swait (2004) find that self-reported measures of brand credibility affect the likelihood that a brand enters a consumer’s consideration set. Because of the presence of search and evaluation costs in the first stage of the choice process, Eq. (6), information gathering often ends before exhausting all options in Sa and Sc ⊂ Sa . During the second stage, or “choice stage,” the consumer picks the alternative in her consideration set, k ∈ Sc , that yields the highest utility: k = arg max vj . (7) j ∈Sc

The decision in Eq. (7) is a reformulation of the discrete choice problem in Eq. (1) from Section 2.1, where all brands entered the choice problem. Models in the literature typically assume that search fully resolves the uncertainty about considered alternatives.14 14 In Section 3.4 below, we will formally distinguish between search and experience characteristics (e.g., Nelson, 1970). We will then allow for ex post uncertainty about experience characteristics at the time of



CHAPTER 6 The economics of brands and branding

A discussion of additional literature, in Sections 3.2 and 3.3 below, suggests that brand reputation, loyalty, and pioneering advantage can cause a brand to be more likely to enter the awareness and consideration sets.

2.2.3 An econometric specification To illustrate, we now modify the econometric modeling framework from Section 2.1. An early literature abstracted entirely from the formation of consideration sets, modeling them in reduced-form instead (see Shocker et al., 1991). In this literature, the consideration set is treated as an additional random variable and the likelihood that a consumer chooses brand j is as follows: Pr (j ; ) =

Pr (j |Sc ; ) × Pr (Sc ; )


Sc ∈P(S)

where Pr (j |Sc ) is the brand choice probability conditional on the consideration set, Pr (Sc ) is the probability of observing consideration set Sc ∈ P (S), and P (S) is the power set of S. If Sc = S and all brands are considered, then Pr (S) = 1 and Pr (j |S) is the choice probability from Section 2. In practice, consideration sets are unobserved and the likelihoods Pr (j |Sc ) and Pr (Sc ) are not separately identified without strong, and often ad hoc, functional form assumptions.15 While a literature has estimated models of consideration and choice without observing the consideration set, this approach is clearly prone to severe model misspecification concerns. Even when consideration sets are observed, the standard discrete choice models like logit and probit are probably not the correct reduced form for the conditional brand choices.16 In particular, the choice problem   Pr (j |Sc ) = Pr vj ≥ vk , for k, j ∈ Sc


selects on realizations of random utility shocks in vj for products j that were systematically considered. So the choice problem (9) cannot simply “integrate out” the random utility shocks under the usual i.i.d. assumptions to obtain the discrete choice probabilities because Sc is also a function of the realizations of ε for searched brands. Mehta et al. (2003) and Joo (2018) use exclusion restrictions based on the assumption that in-store promotional variables, like display and feature, affect the consumer’s information about products and not consumption utility and preferences: Pr (Sc ; Z) where Z contains the variables excluded from the brand choices conditional on con-

purchase. This uncertainty will be resolved slowly over time across repeated purchase and consumption experiences. 15 Recently, Abaluck and Adams (2017) show that symmetry in the cross-derivatives of choice probabilities only holds when the consumer considers all possible options. They propose to identify consideration probabilities from the violations of symmetry in the cross-derivatives. 16 For instance, Srinivasan et al. (2005) require the strong assumption that brand awareness is independent of brand preferences in order to retain the conventional logit functional form.

2 Brand equity and consumer demand

sideration.17 In addition, the model in Eq. (8) suffers from a curse of dimensionality since the dimension of P (S) becomes unmanageable as the set of available products grows. To provide an illustrative model, at the consideration stage, we assume that the consumer’s indirect utility from purchase is additively separable in the factors that are known deterministically to the consumer and the factors about which she is still uncertain. Formally, as in the discrete choice problem of Eq. (1) above, assume the consumer’s choice-specific indirect utilities vj have a component xj β + γj + εj that is known deterministically to the consumer (the econometrician only observes the distribution of ε and E (ε) = 0). In addition, the indirect utilities contain the vector of unknown match values, ξ ∼ F (ξ ) about which the consumer is uncertain. F (ξ ) represents the consumers’ beliefs about the unknown match values. Thus, vj = xj β + γj + εj + ξj , j ∈ Sa .


During the search stage, the consumer endogenously resolves her uncertainty ξj for a set of considered products, Sc ⊆ Sa . Products in the consideration set are selected based on (1) their respective option values (e.g., the variance of ξj ), (2) known indirect utilities (xj β + γj + εj ), and (3) search costs (Cj ). In contrast to the Lancasterian approach above, the deterministic component of preferences only partially determines the chosen product. Search costs and the option value from additional search and evaluation also influence the overall considered set and, therefore, the chosen product alternative. In our illustrative model, we assume that the total  search and evaluation cost associated with a consideration set Sc is C (Sc ) = j ∈Sc cj where Cj ≡ cj is independent of ck (∀k = j ), and cj ≥ 0.18 In turn, the search costs, cj , reflect a consumer’s past experiences with the available products in the category or a firms’ advertising strategy.19 For instance, we could assume: cj = c0 + G1j (H , A; 1 ) ,


where the function G1j captures the effect of state vectors summarizing a consumer’s brand purchase history, H , past exposure to advertising, A, and 1 is a vector of parameters. Thus, shopping history, purchase experience, and advertising exposure 17 At least since Guadagni and Little (1983), the brand choice literature has found that these promotional

variables affect choices. The exclusion restriction assumes, logically, that the effect reflects the ease of search and product evaluation as opposed to preference. Indeed, it seems unlikely that consumers derive consumption utility from an in-store display. However, to the best of our knowledge, this point has not been tested empirically. 18 This specification treats search and evaluation costs as a fixed parameter. Alternatively, the rational inattention literature on consumer choice (e.g., Matejka and McKay, 2015) uses Shannon entropy to model the costs associated with the precision of the signals a consumer endogenously collects to learn about product values. 19 To the extent that advertising influences brand knowledge and psychological associations, it could also facilitate recall.



CHAPTER 6 The economics of brands and branding

influence her costs, cj , of gathering and evaluating information about the choice alternative. For instance, in a case study of residential plumbers, McDevitt (2014) finds that low-quality firms systematically use easier-to-find brand names that start with A or a number and tend to be located at the top of directories. Furthermore, in a study of consumers’ retail bank choices, Honka et al. (2017) find that advertising has a much larger effect on awareness and consideration than on consumers’ choices from their respective consideration sets. The framework herein points towards a major limitation of the extant empirical literature and an opportunity for future research. While some progress has been made in the collection of consideration set data, a consumer’s awareness set, Sa , is not observed in typical choice datasets. Consequently, the consideration papers derived from search theory generally assume that consumers are aware of all the product alternatives and search over an i.i.d. match value. This assumption is at odds with laboratory studies conducted by consumer psychologists that question the plausibility of “full awareness” and document evidence suggesting very limited consumer brand awareness, even at the point of purchase. Econometrically, the distinction between consideration and awareness offers a potential direction for future research, along with a push to integrate research on memory into our demand models. If consumers only search and, thus, only consider brands they can recall from memory (i.e., brands in their awareness set), then awareness and marketing investments that stimulate awareness may create barriers to entry for new products.

2.2.4 Consideration and brand valuation The possibility of a brand effect in the pre-purchase search and product evaluation process raises concerns about the measurement of brand equity measures like those in Section 2.1 above. Standard discrete choice models that ignore the search and consideration aspects of demand will load the entire brand effect, including that of recall γ and search, into preferences for the brand γj , defining brand equity as BEj = θj . The omission of the consideration stage could bias estimates of brand value. Consider the illustrative case where the consumer has homogeneous ex-ante beliefs   about her indirect utility for each of the products in her awareness set, i.e., E vj = v, ∀j ∈ Sa , but where the search costs, cj , to resolve the match value ξj are lower for branded than unbranded alternatives among the products in Sa . In this case, the consumer’s consideration set and observed brand choices would systematically contain branded products over unbranded ones; not because of higher utility, vj , but because of lower search costs cj . The omission of the consideration stage could confound search costs and brand preferences. The extent of this confound would be exacerbated by the number of alternatives available and/or the magnitude of search costs. To the extent that search stops before any unbranded alternatives are discovered and considered, a traditional Lancasterian model would generate strong estimated brand preferences and potentially high estimates of brand equity. Substantively, this scenario could lead to the conclusion that brand value stems from preferences, as opposed to from the ease of search.

3 Consumer brand loyalty

3 Consumer brand loyalty 3.1 A general model of brand loyalty In the previous section, we used a static perspective on brand choice that treated brand equity as a persistent residual (or “fixed effect”) in a characteristics specification of the economic model of consumer demand. However, the model abstracted from the manner in which the brand preference was formed. If brand preference is merely a nuisance or control, this may be sufficient for predicting demand. However, as we show below, the dynamics associated with the formation of brand preferences may be important for understanding product differentiation and the foundations of market structure and concentration. In this section, we discuss various dynamic theories of the formation of brand preferences. Consumer psychologists have studied how a consumer develops a brand preference through positive associations between the brand and the consumption benefits of the underlying product. Such associative learning could arise, for instance, through signals whereby the consumer learns that the brand predicts a positive consumption experience. Alternatively, under evaluative conditioning, the consumer forms a positive preference for a brand through repeated co-occurrences with positive stimuli, like good mood, affect, or a popular celebrity. In the same vein, a consumer may learn about a brand through her memory of positive experiences with similar products. We refer the interested reader to Van Osselaer (2008) for a survey of the consumer psychology literature on consumer learning processes. This chapter will not discuss the deeper psychological mechanisms through which preferences are formed. Instead, we focus on how different sources of brand preference formation create dependence on past choices or state dependence in consumer demand. Empirically, state dependence can lead to inertia in a consumer’s observed sequence of brand choices: consumers have a higher probability of choosing products that they previously purchased. Brand choice inertia is one of the oldest and most widely-studied empirical phenomena in the marketing literature (e.g., Brown, 1952, 1953) as it has typically been interpreted as “brand loyalty.” Below we survey the empirical evidence for inertia in consumer brand choices and discuss the econometric challenges associated with disentangling spurious sources of inertia from genuine structural state dependence, such as loyalty. We then discuss several consumer theoretic mechanisms that can generate brand choice inertia as a form of structural state dependence. To formalize our discussion of the empirical literature, we consider the choices of a household, h, over brands, j , at time, t. We use Xht to denote the contemporaneous factors such as product characteristics and marketing considerations like prices, promotions, and shelf space. We use the state vector H ht to denote a consumer’s historic brand experiences. We include these state variables, Xht and H ht , into a consumer h’s indirect utility for brand j on date t:     vjht = μj X ht ; h + Fj H ht ; h , j = 1, ..., J




CHAPTER 6 The economics of brands and branding

where we now decompose the consumer’s  brand equity into the static components discussed in the last section, μj Xht ; h , and the consumer’s past experiences,   Fj H ht ; h , comprising a stock of historically formed brand “capital.” The vectors h and h are parameters to be estimated. Theorists have analyzed various mechanisms through which current willingness to pay for brands reflects past brand experiences. In the subsections below, we ex plore several formulations of the brand capital stock, Fj H ht ; h , such as switching costs (e.g. Farrell and Klemperer, 2007), advertising and branding goodwill (e.g., Doraszelski and Markovich, 2007; Schmalensee, 1983), evolving quality beliefs through learning (Schmalensee, 1982), habit formation (e.g. Becker and Murphy, 1988; Pollak, 1970), and peer influence (e.g. Ellison and Fudenberg, 1995).

3.2 Evidence of brand choice inertia The empirical analysis of brand loyalty, or inertia in brand choice, has been one of the central themes of the quantitative marketing research on brand choices. Most of the literature has focused on short-term forms of persistence in brand choices over time horizons of no more than one or two years. Early work by Brown (1952, 1953) exploited household-level diary purchase panel data to document the high incidence of spells during which a household repeat-purchased the same brand over time. Such persistent, repeat-purchase of the same brand has been detected subsequently across a wide array of industries, including those dominated by sellers with products differentiated mainly by brand names rather than objective features. Empirical generalizations across a broad array of CPG categories have found low rates of household brand switching (Dekimpe et al., 1997) and high rates of expenditure concentration with typically over 50% of spending allocated to the most preferred brand in a category (Hansen and Singh, 2015). Similar patterns of inertia in choices have been documented in other industries such as insurance (Handel, 2013; Honka, 2014), broadband services (Liu et al., 2010), cellular services (Grubb and Osborne, 2015), and financial services (Allen et al., 2016). While early work interpreted short-term brand re-purchase spells as evidence of loyalty, the mere incidence of repeat-buying need not imply inertia per se. A consumer with a strong preference for Coca-Cola has a high probability of repeatpurchasing Coca-Cola over time, even if her shopping behavior is memoryless and static. A test for inertia consists of testing for non-zero-order behavior in a consumer’s choice sequence. Early work tested for higher-order behavior using the within-household variation in choices, often with a non-parametric analysis of the observed runs20 within a given consumer’s choice sequence (e.g., Frank, 1962; Massy, 1966; Bass et al., 1984). Unfortunately, short sample sizes typically limited the power of these within-household tests and the findings were typically mixed or inconclusive; although early work often interpreted a failure to reject the null hypothesis of a 20 A run is broadly defined as a sequence of repeat-purchases of the same brand. Typically, researchers

look at pairs of adjacent shopping trips during which the same brand was purchased.

3 Consumer brand loyalty

zero-order choice process as evidence against loyalty. Alternative testing approaches that pooled choice sequences across consumers ran into the well-known identification problem of distinguishing between choice inertia and heterogeneity in consumer tastes (e.g., Massy, 1966; Heckman, 1981). More recent structural approaches have applied non-linear panel methods to test for choice inertia while controlling for heterogeneity between consumers using detailed consumer shopping panels (Roy et al., 1996; Keane, 1997; Seetharaman et al., 1999; Shum, 2004; Dubé et al., 2010; Osborne, 2011).21 This literature has documented surprisingly high levels of inertia in brand choices. For instance, Dubé et al. (2010) find a substantial decline in the predictive fit of a choice model when, all else equal, the exact sequence of a consumer’s purchases is randomized. This evidence confirms that the observed sequence of choices contains information for predicting demand. We discuss these methods further below in Section 3.3. Patterns of brand choice persistence have also been measured over much longer time horizons, spanning decades or even an individual’s lifetime. For instance, Guest (1955) surveyed 813 school children on their preferred brands in early 1941. Twelve years later, in the spring of 1953, he repeated the same brand preference survey among the 20% of original respondents that he was able to contact. Across 16 product categories, a respondent indicated the same preferred brand in both waves in 39% of the cases. In 1961, a third wave of the same survey continued to find the same preferred brand in 26% of cases. These survey results suggest that brand preferences developed during childhood partially persist into adulthood. However, “obviously, one cannot simply assume that what is learned during childhood somehow ‘transfers intact’ to adult life” (Ward, 1974). Returning to our model in Eq. (12), under this extremeview, a consumer’s preferences throughout her lifetime are entirely driven by Fj H h0 ; h where H0 represents her initial experiences in life, and   μj Xht ; h = 0. A proponent of this view, Berkman et al. (1997, pp. 422-423) suggests that preferences may be inter-generational: “[i]f Tide laundry detergent is the family favorite, this preference is easily passed on to the next generation. The same can be said for brands of toothpaste, running shoes, golf clubs, preferred restaurants, and favorite stores.” The literature on consumer socialization research has studied mechanisms through which adult brand preferences are formed early in life during childhood (Moschis and Moore, 1979), especially through inter-generational transfer and parental influence (Ward, 1974; Moschis, 1985; Carlson et al., 1990; Childers and Rao, 1992; Moore et al., 2002) and peer influence (Reisman and Roseborough, 1955; Peter and Olson, 1996). Anderson et al. (2015) document a strong correlation in the automobile brand preferences of parents and their adult children. Sudhir and Tewari (2015) use a twenty-year survey panel of individual Chinese consumers and find that growing up in a region that experienced rapid economic growth during one’s adolescence is correlated with consumption of non-traditional “aspirational” goods and 21 These methods also control for other causal factors, such as prices and point-of-purchase marketing

that could potentially confound evidence of inertia.



CHAPTER 6 The economics of brands and branding

brands during adulthood.22 Similarly, having a birth year in 1962 or 1978 is a very strong predictor of whether a male Facebook user “likes” the New York Mets in the mid-2000s, implying the user was seven to eight years old when the Mets won a world series (in 1969 and 1986) – an age at which team preferences are typically formed (Stephens-Davidowitz, 2017). Bronnenberg et al. (2012) match current and historic brand market share data across US cities.23 These data confirm that consumers brands (1) had very different shares across regions in the 1950s and 1960s, and (2) that the local market leaders of the 1950s and 1960s remain dominant in their respective markets today. In practice, decades-long panels are difficult to maintain and rarely available for research purposes. Therefore, the within-household shopping purchase information is too short to learn about the formation of preferences. Instead, Bronnenberg et al. (2012) surveyed over 40,000 US households to learn the migration histories of each household’s primary shopper, including her birth market, year of move, and her age. They exploit the historic migration behavior of households and the long-term regional differences in brand preference to study the long-term formation of brand preference and loyalty. Studying the two top brands across 238 product categories, Bronnenberg et al. (2012) document two striking regularities. First, immediately after a migrant family moves, 60% of the difference in brand shares between the state of origin and current state of residence is eliminated. This finding holds both within and between households, suggesting that a significant portion of brand preferences is determined by the local choice environment. Second, the remaining 40% of the preference gap is very persistent, with migrants exhibiting statistically significant differences in brand preference than non-migrants even 50 years after moving. Collapsing the data by age cohorts, Bronnenberg et al. (2012) find that “migrants who moved during childhood have relative shares close to those of non-migrants in their current states, while those who move later look closer to non-migrants in their birth states.” This finding is consistent with the brand capital stock theory whereby older migrants, having accumulated more brand capital in their birth state, should exhibit more inertia in brand choice.24 Even migrants that moved before the age of 6 exhibit some persistence in the local preference from the birth location, suggesting a role for some inter-generational transfer of brand preferences. The authors conclude that “since the stock of past experiences has remained constant across the move, while the supply-side environment has changed, we infer that approximately 40 percent of 22 These aspirational goods consist primarily of western brands consumed socially. 23 The current brand shares were collected through AC Nielsen’s scanner data. The historic brand shares

were obtained from the Consolidated Consumer Analysis (CCA) database, collected by a group of participating newspapers from 1948 until 1968 in their respective markets. The CCA volumes report the fraction of households who state that they buy a given brand in a given year. 24 An alternative hypothesis is that the aging process makes working memory decline more than longterm memory (Carpenter and Yoon, 2011), as does processing of information. Both aging effects favor relying on fewer considered options (John and Cole, 1986) and engaging in fewer product comparisons (Lambert-Pandraud et al., 2005). These factors are thought to contribute to persistence, or at least less flexibility, of purchasing patterns among aging consumers (see also Drolet et al., 2010).

3 Consumer brand loyalty

the geographic variation in market shares is attributable to persistent brand preferences, with the rest driven by contemporaneous supply-side variables.” In terms of our model in Eq. (12), approximately   40% of consumers’expected conditional indirect utility derives from Fj H ht ; h and 60% from μj Xht ; h . Consistent with these findings, Atkin (2013) reports a similar long-term habit formation for nutrient sources from different foods. Bronnenberg et al. (2012) formulate a simple model of habit formation (e.g., Pollak, 1970; Becker and Murphy, 1988) in which individual households’ brand choices depend on current marketing and prices, as well as a stock of past consumption experiences. Assuming (1) that consumers did not move across state lines motivated by their preferences for CPG brands and (2) that a brand’s past local market share is on average equal to the share today, they determine that the effects of past consumption are highly persistent and depreciate at a rate of only 2.5% per year. Thus, they find that the half-life of brand capital is 26.5 years. In sum, a large body of empirical work has documented patterns of persistence in brand preferences and choices. This persistence has been documented both at a high-frequency from “shopping trip to shopping trip” as well as at a much lower frequency spanning decades and even entire lifetimes. If consumers do indeed form strong attachments to brands, then understanding the mechanisms through which these attachments are formed will likely point to some of the important drivers of the industrial organization of consumer goods’ markets.

3.3 Brand choice inertia, switching costs, and loyalty Switching costs constitute one of the simplest theories of brand loyalty: “A product exhibits classic switching costs if a buyer will purchase it repeatedly and find it costly to switch from one seller to another.” Klemperer (2006)

Switching costs can be financial, such as the early termination fee for a mobile phone service contract, temporal, such as the time required to learn how to use a new product, or psychological, such as the cognitive hassle of changing one’s habit.25 Switching costs introduce frictions that can deter a consumer from switching to different brands and, hence, can lead to repeat-purchase behavior. In the extreme case where switching costs are infinite, a consumer’s initial choice would  determine her  entire future brand choice sequence and the impact of μ X ht ; h would be zero. Consequently, switching costs can create brand loyalty even in the absence of any brand value other than the identifying feature of the brand name to a specific supplier. This behavior points to a simple theory of branding whereby the identifying features of a supplier’s product (i.e., the “mark”) can be sufficient to create loyalty in

25 At least since Mittelstaedt (1969), consumer psychologists have studied the role of psychological

switching costs in explaining repeat-purchase and inertia in brand choice. For an extensive review of this literature see (Muthukrishnan, 2015).



CHAPTER 6 The economics of brands and branding

consumer shopping behavior as long as consumers form shopping habits. This type of loyalty is also detectable in shorter panels spanning one or two years. The empirical brand choice literature typically allows for a brand switching cost to influence purchase decisions in consumer goods categories (e.g., Jeuland, 1979; Guadagni and Little, 1983; Jones and Landwehr, 1988; Roy et al., 1996; Keane, 1997; Seetharaman et al., 1999; Shum, 2004; Osborne, 2008; Dubé et al., 2010). Suppose we define a household h’s indirect utility net of switching cost to be μhj (X t ) = γjh + θ h x j t + εjht . Following the convention in the brand-choice literature, we assume that consumers obtain a utility premium from repeat-buying the brand chosen previously:   Fj Hth ; h = λh I{H h =j } where Hth ∈ {1, ..., J } is household h’s loyalty state and t I{H h =j } indicates whether the previous brand purchased was brand j . The conditional t indirect utility on trip t is then vjht

    = μhj Xht ; h + Fjh Hth ; h , j = 1, ..., J (13) = γjh + θ h xj t + λh I{H h =j } + εjht . t

This formulation nests the basic static model from Section 2.1 with the baseline brand utility, γ h . The additional parameter λhj allows for inertia in brand choices across time. As discussed above, the structural interpretation of λh is typically analogous to a psychological switching cost. The following null hypothesis constitutes a test for choice inertia:   (14) H0 : E λh = 0,  h where E λ > 0 implies a positive inertia in brand choice (such as loyalty) and   E λh < 0 implies a negative inertia (such as variety-seeking). In practice, the researcher can specify a more general specification that relaxes both the linearity assumption and allows for higher-order choice behavior with a loyalty state that reflects the entire choice history. For instance, Keane (1997) and Guadagni and Little (1983) use a stock variable constructed as an exponential, smooth, weighted average of past choices. While most studies of brand loyalty assume consumers are myopic, a rational forward-looking consumer would plan her future loyalty, much like the rational addiction models fit to tobacco products (e.g., Becker and Murphy, 1988; Gordon and Sun, 2015). Since most of the brand choice literature pools choice sequences across households, a concern is that state dependence captured by λh may be spuriously identified by unobserved heterogeneity in tastes between households (Heckman, 1981).26 Even

26 When the researcher does not observe consumers’ initial choices, an “initial conditions” bias can also

arise from the endogeneity in consumers’ initial observed (to the researcher) states. Handel (2013) avoids this problem in his analysis of health plan choices. He exploits an intervention by an employer that changed the set of available health plans and forced employees to make a new choice from this changed menu.

3 Consumer brand loyalty

after rich controls for persistent unobserved heterogeneity and serial dependence in εjht , Keane (1997) and Dubé et al. (2010) find statistically and economically significant state dependence in choices. However, the magnitude of the state dependence, λ, is considerably smaller after the inclusion of controls for heterogeneity, falling on average by more than 50%. For instance, in a case study of refrigerated orange juice purchases, Dubé et al. (2010) estimate switching costs that, on average, are 21% of the magnitude of prices. Without controls for heterogeneity, these costs are inferred to be more than double.27 In addition to controlling for heterogeneity, Dubé et al. (2010) also test between several alternative sources of structural state dependence, such as price search and learning. Intuitively, state dependence through consumer learning should dissipate over time as consumers learn through their purchase and consumption decisions. In contrast, switching costs create a persistent form of inertia in choices. We discuss the mechanism through which product quality uncertainty and consumer learning can generate inertia in consumer brand choices below in Section 3.4. Dubé et al. (2010) conclude that the inferred brand switching costs are robust to these alternative specifications and that the estimated values of λ reflect true brand loyalty.28 Similarly, imperfect information about prices or availability could also create state dependence in the purchase of a known brand. In-store merchandizing, such as displays, should offset the costs of determining a brand’s price in which case inertia for a given brand would be offset by a display for a competing brand. Dubé et al. (2010) again find that their estimates of switching costs are robust to controls for search costs.29 Interestingly, Keane (1997) and Dubé et al. (2010) estimate economically large and heterogeneous brand intercepts, γj . On average, the persistent differences in households’ brand tastes appear to be much more predictive of choices than the loyalty arising through λ. In a case study of 16 oz tubs of margarine purchases, Dubé et al. (2010) find the importance weights for loyalty (λ), price (θ ), and brand (γ ) are 6.4%, 53.6%, and 40% respectively.30 Therefore, switching costs alone do not seem to explain the persistent consumer brand preferences typically inferred through CPG shopping panels. In sum, while there is a component of consumer switching that re27 Dube et al. (2018) find even larger magnitudes of switching costs when they control more formally for

endogenous initial conditions (i.e., the initial loyalty state for each household). 28 Using a structural model of consumer learning, Osborne (2011) also finds empirical evidence for both

learning and switching costs. 29 Using a structural model of search and switching costs, Honka (2014) also finds empirical evidence for

both search and switching costs. However, in her case study of auto insurance, search costs are found to have a larger effect on choices than switching costs. 30 Following the convention in the literature on conjoint analysis, an importance weight approximately describes the percentage of utility deriving from a given component. The model in Eq. (13) has three components to utility: brand, marketing variables, and loyalty with respective part-worths (or marginal   utilities) P W brand (brand = j ) = λj − min(0, {λk }Jk=1 ), P W marketing Xj t = x = α(x − min(x)), and   P W loyalty sj t = j = γ . Dubé et al. (2010) then assign an importance weight to each of these compo  max P W brand     ,  nents, scaled to sum to one, as follows: I Wbrand = max P W brand +max P W price +max P W loyalty



CHAPTER 6 The economics of brands and branding

flects dynamics related to loyalty, a large portion of consumers’ brand choices seem to reflect a far more persistent brand taste that is invariant over the time horizons of 1-2 years typically used in the literature. A limitation of this literature is the lack of a deeper test of the underlying mechanism creating this persistence in choices. As early as Brown (1952, p. 256), scholars have questioned whether inertia in brand choice “is a ‘brand’ loyalty rather than a store, price or convenience loyalty.” The subsequent access to point-of-sale data allows researchers to control for prices and other causal factors at the point of purchase. But, within a store a buying habit could merely reflect loyalty to a position on the shelf or other incidental factor that happens to be associated with a specific brand. In addition to unobserved sources of loyalty at the point of sale, the persistent brand tastes may also contain additional information about longer-term forms of brand loyalty, such as evolving brand capital stock (e.g., Bronnenberg et al., 2012), that are not detectable over one or two-year horizons. While these distinctions may not matter for predicting choices over a short-run horizon, they have important implications for a firm’s willingness to invest in branding or consumer-related marketing to cultivate the shopping inertia.

3.4 Learning from experience A more complex form of state-dependence in brand choices arises when a consumer faces uncertainty about aspects of product quality that are associated with the brand and that are learned over time. Following Nelson (1970), we modify the characteristics approach to demand by distinguishing between “search characteristics,” which can be determined prior to purchase, and “experience characteristics,” which are determined after the purchase through trial and consumption. The classification of brand as a search versus experiential characteristic is complicated. On the one hand, the brand name as an identifying mark acts like a search characteristic since it is likely verifiable prior to purchase through its presence on the packaging. On the other hand, intangible aspects of the product that are associated with the brand constitute experience characteristics that are learned over time through consumption (Meyer and Sathi, 1985) and informative advertising (Keller, 2012, Chapter 2). This view is consistent with the product-based associations that constitute part of a consumer’s brand knowledge (Keller, 2012, Chapter 2). We focus herein on rational models of consumers using Bayes’ rule to update their beliefs about products over time and to learn.31

price )  max(P  W   ,  and I Wloyalty = max P W brand +max P W price +max P W loyalty loyalty ) max(P W     .  max P W brand +max P W price +max P W loyalty 31 The Bayesian learning model predicts that consumers eventually become fully-informed about a brand.

I Wprice


However, lab evidence suggests that “blocking” may prevent a consumer from learning about objective product characteristics. If a consumer initially learns to use the brand name to predict an outcome (e.g.

3 Consumer brand loyalty

Suppose the consumer is uncertain about the intrinsic brand quality in any period t. At the start of each period, the consumer has a prior belief about brand quality, fj t (γ ). At the end of each period, she potentially receives a costless, exogenous, unbiased, and noisy signal about brand j , sj t ∼ gj (·|γ ). For example, the signal might reflect a free sample, word-of-mouth (Roberts and Urban, 1988), observational learning from peers’ choices (Zhang, 2010), or an advertising message (Erdem and Keane, 1996). The consumer then uses the signal to update  her beliefs about the brand’s quality using Bayes’ Rule: fj (t+1) (γ ) ≡ fj t γ |sj t ∝ gj (·|γ ) fj t (γ ). In this case, the state variable tracking consumer brand experiences, H t , consists of her prior beliefs about each of the brand qualities, γj : H t = (f1t (γ ) , ..., fJ t (γ )).32 We use a discrete-choice formulation of demand, as in Section 2.1. If the consumer’s brand choice is made prior to receiving the signal, her expected indirect utility from choosing brand j at time t is        λ γ − ργ 2 + xj β − αpj t + εj t gj (s|γ ) fj t (γ ) d (s, γ ) E uj t |H t ; θ = (15) where ρ > 0 captures risk aversion. As discussed in Erdem and Keane (1996) and Crawford and Shum (2005), risk aversion is essential for predicting inertia in consumer’s choices for familiar brands since a consumer may be reluctant to purchase a new brand with uncertain quality. The vector θ contains all the model parameters, including those characterizing the consumer’s beliefs. We can augment the model in (15) to allow the consumer to learn over time through her own endogenous brand choices (Erdem and Keane, 1996). Suppose each time the consumer purchases brand j , her corresponding consumption experience generates an unbiased, noisy signal about the quality of brand j : sj t ∼ gj (·|γ ). Let Dj t indicate whether the consumer purchased brand j at time t. To simplify, we follow the convention inmost of the  literature and assume the consumer’s initial period 2 prior is fj 0 (γ ) = N γj 0 , σj 0 and that her consumption signal in a given period   is sj t ∼ N γj , σs2 . The advantage of this Normal Bayesian Learning model is that the consumer’s state now consists of the beginning-of-period prior  means and vari2 , rather than each σ ances for each of the J brands, H t = γ1t , ..., γJ t , σ1t2 , ...,  JT of the J Normal prior distributions, fj t (γ ) = N γj t , σj2t . In addition, under Normal Bayesian learning the consumer’s period t prior mean and variance for brand


  γj 0 Dj τ τ 0, and Fψ(ψ) is bounded above. The latter assumption ensures that as brand quality levels increase, the incremental costs to raise quality do not become arbitrarily large. In the third stage, firms play a Bertrand pricing game conditional on the perceived product attributes and marginal costs c (Q; ), whereQ is  the total quantity sold by the firm. If we further assume that c (Q; ) < y¯ < max y h , where y¯ is an upper bound on costs, then “the increase in unit variable cost is strictly less than the marginal valuation of the richest consumer” (Shaked and Sutton, 1987, p. 136). Accordingly, the model bounds above how quickly unit variable costs can increase in the level of quality being supplied.39 At the same time, on the demand side there will always be some consumers willing to pay for arbitrarily large brand quality levels. In other words, costs increase more slowly than the marginal valuation of the “highest-income” consumer. Seminal work by Shaked and Sutton (1987) and Sutton (1991) derived the theoretical mechanisms through which the manner in which brands differentiate goods, uψ and ud , and the convexity of the advertising cost function, F (ψ), ultimately determine the industrial market structure. The following propositions are proved in Shaked and Sutton (1987): Proposition 1. If uψ = 0 (i.e. no vertical differentiation), then for any ε > 0, there exists a market size S ∗ such that for any S > S ∗ , every firm has an equilibrium market share less than ε. Essentially, in a purely horizontally-differentiated market, the limiting concentration is zero as market size increases. The intuition for this result is that as the market size increases, we observe a proliferation of products along the horizontal dimension until, in the limit, the entire continuum is served and all firms earn arbitrarily small shares. Proposition 2. When uψ > 0, there exists an ε > 0 such that at equilibrium, at least one firm has a market share larger than ε, irrespective of the market size. As market size increases for industries in which firms can make fixed and sunk investments in brand quality (i.e., vertical attributes), we do not see an escalation in entry. Instead, we see a competitive escalation in advertising spending to increase the perceived quality of products. The intuition is that a firm perceived to be higherquality can undercut its “lower-quality” rivals. Hence, the firm perceived to be the highest-quality will always be able to garner market share and earn positive economic profits. At the same time, only a finite number of firms will be able to sustain such high levels of advertising profitably, which dampens entry even in the limit. These two results indicate that product differentiation per se is insufficient to explain concentration. Concentration arises from competitive investments in vertical product 39 Note that we are abstracting away from the richer setting where a firm can invest in marketing over

time to build and maintain a depreciating goodwill stock, as in Section 4.2.2 above.



CHAPTER 6 The economics of brands and branding

differentiation. When firms cannot build vertically-differentiated brands (by advertising) we expect markets to fragment as market size grows. In contrast, when firms can invest to build vertically-differentiated brands, we do not expect to see market fragmentation, but rather an escalation in the amount of advertising and the perseverance of a concentrated market structure. The crucial assumption is that the burden of advertising falls more on fixed than variable costs. This assumption ensures that costs do not become arbitrarily large (i.e. prohibitively large) as quality increases. Consequently, it is always possible to outspend rivals on advertising and still impact demand. This seems like a reasonable assumption for the CPG markets in which advertising decisions are made in advance of realized sales. It is unlikely that advertising spending would have a large influence on marginal (production) costs of a branded good. Extending the ESC theory to a setting in which firms make their entry and investment decisions sequentially, strengthens the barriers to entry by endowing the early entrant with a first-mover advantage. With vertical differentiation and endogenous sunk advertising costs, an early entrant can build an even larger brand that pre-empts future investment by later entrants (e.g., Lane, 1980; Moorthy, 1988; Sutton, 1991). Using cross-category variation, Sutton (1991) provides detailed and extensive, cross-country case studies that empirically confirm the central prediction of ESC theory in food industries, finding a lower bound on concentration in advertising-intense industries but not in industries where advertising is minimal or absent. Bronnenberg et al. (2011) find similar evidence for US CPG industries, looking across US cities of differing size, with a lower bound on concentration in advertising-intense CPG industries and a fragmentation of non-advertising-intense CPG industries in the larger US cities. Order of entry is also found to play an important role on market structure, with early entrants sustaining higher market shares even a century after they launched their brands (e.g., Bronnenberg et al., 2009). To understand the important role of the convexity of the marketing cost function, Berry and Waldfogel (2010) show that market concentration fragments and the range of qualities offered rises in larger markets in the restaurant industry, where quality investments comprise mainly variable costs. In contrast, for the newspaper industry, where quality investments comprise mainly fixed costs, they observe average quality rising with market size without fragmentation. Sutton (1991) even provides anecdotal evidence linking historical differences in advertising across countries to national market structure and demand for branded goods. For instance, relative to the United States, TV advertising used to be restricted in the United Kingdom. Consistent with advertising as an intangible attribute, Sutton (1991) notes that the market share of private labels is much higher in the UK than in the US. Interestingly, the theory does not imply that the market be served exclusively by branded goods. When the consumer population consists of those who value the vertical brand attribute and those who do not, it is possible to sustain entry by advertising and non-advertising firms. The latter will serve the consumers with no willingness-topay a brand premium. However, as the market size grows, the unbranded sub-segment

5 Branding and firm strategy

of the market fragments. In their cross-industry analysis, Bronnenberg et al. (2011) observe an escalation in the number of non-advertising CPG products in a given category in larger US cities, with concentration converging towards 0.

5.2 Brands and reputation Traditionally, the term brand was associated with the identity of the firm manufacturing a specific commodity. As a brand developed a reputation for reliability or high quality, consumers would potentially pay a price premium for the branded good. A central idea in markets for experience goods and services is that a firm’s reputation matters when consumers have incomplete information about product quality and fit prior to consuming the product (Nelson, 1970, 1974). In such markets, inexperienced consumers can be taken advantage of by firms selling low quality at high prices. The longer it takes for consumers to discover true product quality, the more beneficial it is for firms to sell low quality at a high price. Perhaps the most straightforward role of brands in such a setting is that they allow consumers to connect one purchase to the next. This connection provides the basis for holding firms accountable for their actions even without a third-party (e.g., government) enforcing contracts. In turn, it provides the basis for “reputations”—how they arise and why they are relevant to consumers.40 In common parlance, reputation is a firm’s track-record of delivering high quality; in theoretical models, it is consumers’ beliefs about product quality. There is a large literature that ties the provision of product quality to seller identification or lack thereof. On the negative side, Akerlof (1970) shows that non-identifiability of firms in experience goods markets leads to deterioration of quality because low quality firms cannot be punished, and high quality firms rewarded, for their actions.41 On the positive side, Klein and Leffler (1981, p. 616) observe that branding enables reputations to form and to be sustained: “...economists also have long considered ‘reputations’ and brand names to be private devices which provide incentives that assure contract performance in the absence of any third-party enforcer (Hayek, 1948, p. 97; Marshall, 1949, vol. 4, p. xi). This private-contract enforcement mechanism relies upon the value to the firm of repeat sales to satisfied customers as a means of preventing nonperformance.”

In a competitive market, the identification role of a brand can benefit a firm because “a firm which has a good reputation owns a valuable asset” (Shapiro, 1983, p. 659). For instance, Png and Reitman (1995) find that branded retail gasoline stations are 40 We focus herein on the role of brand reputations, referring readers looking for a broader discussion of

the literature on reputation to the survey in Bar-Isaac and Tadelis (2008). 41 “The purchaser’s problem, of course, is to identify quality. The presence of people in the market who

are willing to offer inferior goods tends to drive the market out of existence – as in the case of our automobile ‘lemons.’ ” (Akerlof, 1970, p. 495).



CHAPTER 6 The economics of brands and branding

more likely to carry products and services with important experiential characteristics that could be verifiable through a reputation for quality, such as premium gasoline and repair services. Similarly, Ingram and Baum (1997) report that chain-affiliated hotels in Manhattan had a lower failure rate than independent hotels. In case studies of jeans and juices, Erdem and Swait (1998) find that consumer demand responds to self-reported survey measures of brand credibility. Klein and Leffler (1981) and Shapiro (1983) examine the incentives firms need to maintain reputations. A simple model illustrates their arguments. Suppose a monopolist firm can offer high (H ) or low quality (L) every period, with H costlier to produce than L: ch > cl . Suppose also that there are N consumers in the market, each looking to buy at most one unit of the product. Assume for simplicity that all consumers buy at the same time or, equivalently, that there is instantaneous word-of-mouth from one consumer to all. Consumers’ reservation prices are vh and vl for products known to be high-quality and low-quality, respectively, with vh > vl . Assume vh − ch > vl − cl , i.e., the firm would prefer to offer high quality if consumers were perfectly informed about quality. The product, however, is an experience good and consumers only observe price before purchase. They observe quality after purchase. We now analyze the firm’s behavior in a “fulfilled expectations” equilibrium, i.e., an equilibrium in which consumers’ expectations about firm behavior match the firm’s actual behavior. In one such equilibrium, consumers expect low quality in every period and the firm delivers low quality every period. The more interesting equilibrium, however, is one in which the firm delivers a high quality product in every period when consumers expect it to do so. Such an equilibrium is sustained by consumer beliefs that punish the firm for reneging on its “reputation for high quality.” Specifically, consumers’ beliefs are that the firm offers H unless proven otherwise, in which case, their future expectations are that the firm will deliver L.42 The existence of this equilibrium requires the firm to have no incentive to deviate from H in any period and, hence, never to offer L in any period. Given a discount factor ρ, the payoff along the equilibrium path is

ρ (ph − ch )N πh = 1−ρ whereas (assuming vl > cl ) the payoff along the deviation path of making a low quality product, but selling it as a high quality product once, is 2

ρ max {0, (vl − cl )} N. πl = ρ(ph − cl )N + 1−ρ The reputation equilibrium is sustained, therefore, if and only if

1 (ch − cl ) + max {cl , vl } ph ≥ ph∗ = ρ 42 Such a punishment may appear draconian—and we will have more to say about whether real-world

firms get punished this way—but for now the theoretical point is that it is the possibility of punishment that provides firms the incentive to maintain reputations.

5 Branding and firm strategy

Since we also need vh ≥ ph , this equilibrium is feasible if and only if vh ≥ ph∗


Obviously, if ρ, the discount factor, is small enough, this condition cannot hold. On the other hand, if ρ is large enough and ch −cl small enough, it is possible to find a ph ∈ (ph∗ , vh ]. In short, if the firm has a sufficient stake in the future, and consumers are willing to pay a sufficient premium for high quality, then the firm is willing to maintain its reputation for high quality by offering high quality, foregoing the shortterm incentives to “harvest” its reputation. Shapiro (1982) extends a model like this to continuous time, and assumes that both quality and price are continuously adjustable. In addition, he allows for general reputation functions—i.e., reputation functions that do not instantaneously adjust to the last quality provided (as in the model above). The main result is that, with reputation adjustment lags, the firm will only be able to sustain less than perfect-information quality in a fulfilled-expectations equilibrium. Essentially, the firm has to pay a price for consumers’ imperfect monitoring technology when it is coupled with consumer rationality. Board and Meyer-ter Vehn (2013) extend this framework to long-lived investment decisions that affect quality and consider a variety of consumer learning processes. They find that when signals about quality are more likely to convey “good news” than “bad news,” a high-reputation firm has the incentive to run down its quality and reputation, while a low-reputation firm keeps investing to increase the possibility of good news. Conversely, when signals about quality are more likely to convey bad news, a low-reputation firm has weak incentives to invest, while a high-reputation firm keeps investing to protect its reputation. In practice, the extent to which reputation incentives discipline firms to deliver persistently high quality is an interesting direction for future empirical research. The recent experience of brands like Samsung, Tylenol, and Toyota, which rebounded quickly from product crises suggests that consumers might be forgiving of occasional lapses in quality, even major ones.43 A limitation of the reputation literature is the assumption that firms are consumerfacing and can be held accountable for their actions. For this reason, Tadelis (1999) notes that brands are natural candidates for reputations because they are observable, even when the firms that own them are not. For example, a consumer can hold a restaurant accountable for its performance across unobserved (to the consumer) changes in the establishment’s ownership as long as the restaurant’s name remains the same. Luca and Zervas (2016) find that restaurants that are part of an established branded chain are considerably less likely to commit review fraud on Yelp. They also find that independent restaurants are more likely to post fake reviews on Yelp when

43 See, for example, “Samsung rebounds with strong Galaxy S8 pre-orders after Note 7 disaster,” New

York Post, April 13, 2017. https://nypost.com/2017/04/13/samsung-rebounds-with-strong-galaxy-s8-preorders-after-note-7-disaster/.



CHAPTER 6 The economics of brands and branding

they are young and have weak ratings. By the same token, a new brand from an existing firm starts with a clean slate; thus a firm can undo the bad reputation of its existing brand by creating a new brand. An illustration of this point appears in a case study of residential plumbers. McDevitt (2011) finds that firms with a high track record of customer complaints typically changed their names. However, changing names is not costless: as we have noted above, besides the direct costs of doing so—researching names, registering the new name, etc.—the more significant expense is the cost of developing awareness of the new brand. Perhaps for this reason, companies whose corporate brands permeate their entire, large, product line—companies such as Ford, Sony, and Samsung—inevitably create sub-brands (a) to establish independent identities in multiple product categories, and (b) to insulate the corporate brand at least partially from the individual transgressions of one product. Examples include Mustang for Ford, Bravia for Sony, and Galaxy for Samsung. The idea that brands serve as repositories for reputation, and provide the right incentives for firms to maintain quality, is perhaps the most fundamental of all the ideas that the economics literature contributes to branding. Its power and empirical relevance is illustrated in a field experiment run on the Chinese retail watermelon market by Bai (2017). She randomly assigns markets either to a control condition, a traditional sticker on each watermelon identifying sellers, which is, however, frequently counterfeited, or to a laser-cut label, which is more expensive to implement and, hence, less likely to be counterfeited. Over time, Bai finds that sellers assigned to the laser-cut label start selling higher quality melons (based on sweetness) and earned a 30-40% profit increment due to higher prices and higher sales.44 These findings are consistent with the predictions of the reputational models above. In the domain of consumer goods, retailers have created brand images for their stores and chain through the assortment of manufacturer brands they carry: “Retailers use manufacturer brands to generate consumer interest, patronage, and loyalty in a store. Manufacturer brands operate almost as ‘ingredient brands’ that wield significant consumer pull, often more than the retailer brand does.” (Ailawadi and Keller, 2004, p. 2)

In some cases, retailers also use exclusive store brands or private labels to enhance the reputation of their stores and chain, by differentiating themselves through these exclusive offerings (Ailawadi and Keller, 2004). Dhar and Hoch (1997) report that a chain’s reputation correlates positively with the breadth and extent of its private label program. To shift their reputation from merely providing value, more recently, retailers have expanded their private label offerings into a full line of quality tiers, including premium private labels that compete head-on with national brands (see, e.g., Geyskens et al., 2010). Recent work suggests that private labels have closed the quality gap and have become vehicles for reputation themselves. For instance, Steenkamp

44 One year after the experiment, once the laser branding was removed, the market reverted back to its

original baseline outcome that was indistinguishable from a market with no labels at all.

5 Branding and firm strategy

et al. (2010) report that as private label programs mature and close the perceived quality gap, consumers’ willingness to pay premia for national brands decreases. The reputation literature underscores the role of time in establishing reputations. It is over time that a reputation develops, as the firm provides repeated evidence of fulfilling consumers’ expectations. A new brand coming into a market may therefore face a “start-up problem”—how to get going on the reputation journey when consumers are reluctant to try it even once. Possible solutions to this range from “introductory low prices,” to offering money-back guarantees, to “renting the reputation of an intermediary” (Chu and Chu, 1994; Choi and Peitz, 2018; Dawar and Sarvary, 1997; Moorthy and Srinivasan, 1995). Brand name reputation has value if and only if quality is not directly observable (Bronnenberg et al., 2015). With the advent of the Internet, independent websites providing direct information about quality have proliferated. As consumers increasingly rely on such websites for product information, the value of brand name reputation is bound to go down. Waldfogel and Chen (2006) noted this as early as 2006. They observed that consumers using information intermediaries such as BizRate.com substantially increased their shopping at “unbranded” retailers such as “Brands For Less,” at the expense of branded retailers such as Amazon.com. More recently, Hollenbeck (2017) and Hollenbeck (2018) have examined the revenue premium enjoyed by chain hotels over independent hotels and observed that it has shrunk over the period 2000-2015, just as online review sites such as TripAdvisor have increased in popularity.

5.3 Branding as a signal Much of what the industry refers to as “branding” activities would appear to the economist as “uninformative advertising,” i.e., advertising that is devoid of credible product quality information. In a series of seminal papers, Nelson (1970) argued that seemingly uninformative advertising for experience goods may nevertheless convey information if there exists a correlation between quality and advertising spending. Assuming consumers can perceive this correlation, it would be rational for them to respond to such advertising. Then the “money-burning” aspect of advertising will have signaling value and a brand value can be established through the mere act of spending money on advertising associated with the brand. A small literature has emerged that attempts to formalize Nelson’s ideas. Among these efforts are Kihlstrom and Riordan (1984), Milgrom and Roberts (1986), Hertzendorf (1993), and Horstmann and MacDonald (2003). In all of these papers, a key necessary condition for Nelson’s argument to work is the existence of a positive correlation between quality and advertising spending in equilibrium. This condition requires that the returns to advertising be greater for a high quality manufacturer than for a low quality manufacturer even after accounting for the latter’s potential incentive to copy the former’s advertising strategy (and thus fool consumers into thinking that its quality is higher than it actually is). In general, this condition is difficult to establish, as illustrated by both Kihlstrom and Riordan (1984) and Milgrom and



CHAPTER 6 The economics of brands and branding

Roberts (1986). The former works in a free-entry framework, with firms behaving as price-takers, and living for two periods. Firms decide whether to advertise or not at the beginning of the first period. In doing so they trade off the advertising benefit of being perceived by consumers as a high quality firm, which fetches higher prices, and the financial cost of advertising spending. As Kihlstrom and Riordan’s analysis demonstrates, it is possible to sustain an advertising equilibrium of the kind Nelson envisaged only under unrealistic cost assumptions or unrealistic informationtransmission assumptions. For instance, if consumers learn true quality in the long run—the second period, in Kihlstrom and Riordan’s framework—then marginal costs cannot be lower for the lower-quality product (for if they were lower, then the lowerquality firm may also be tempted to advertise). On the other hand, if marginal costs are assumed to be lower for the lower-quality product, then it must be assumed that high-quality manufacturers will never be discovered to be high quality (if they do not advertise). This condition rules out, for example, consumers spreading the word about “bargains”—high quality products sold at low prices in the first period because they were mistakenly identified as low quality products due to their lack of advertising. Milgrom and Roberts’s (1986) monopoly model shows that additional issues arise when prices are chosen by the firm. If advertising signals quality, then it is likely that the higher quality firm would also want to choose a higher price. But if prices also vary with quality, consumers can infer quality from the price rather than the advertising. It is unclear why, in a static model, a firm would need to burn money on advertising if it can signal quality through its prices. However, in a dynamic model with complete information about quality in the second period—akin to the first set of information-transmission assumptions in Kihlstrom and Riordan’s framework above—advertising may be needed to signal quality, but only if marginal costs increase in quality (in contrast with Kihlstrom and Riordan’s conditions). But even if marginal costs do increase in quality, the necessity of advertising to signal quality is not guaranteed. In Milgrom and Roberts’s own words: “advertising may signal quality, but price signaling will also typically occur, and the extent to which each is used depends in a rather complicated way, inter alia, on the difference in costs across qualities.” Given the theoretical difficulties in establishing a signaling role for uninformative advertising, perhaps it is not surprising that empirical attempts to find a correlation between advertising and quality have turned out to be inconclusive. Several empirical studies have relied on the correlation between advertising spending and consumers’ perception of product quality using laboratory studies (e.g., Kirmani and Wright, 1989) and transaction histories (e.g., Erdem et al., 2008).45 However, the correlation 45 Ackerberg (2001) is able to identify an informative role of advertising, separately from its consumption

role—what he calls “the prestige effects of advertising”—by contrasting the purchase behavior of “new consumers” and “experienced consumers.” However, as he notes, he can’t identify how advertising is informative: “There are a number of different types of information advertising can provide: explicit information on product existence or observable characteristics, or signaling information on experience characteristics.

5 Branding and firm strategy

one seeks is between “objective quality”—the quality actually built into the product, the sort of quality that might impact production costs—and advertising spending, and not “perceived quality” and advertising spending. As Moorthy and Hawkins (2005) have noted, a correlation between consumers’ perceptions of quality and advertising spending can occur through a variety of mechanisms, not necessarily Nelson’s mechanism. Turning now to the studies examining objective quality-advertising spending correlations, Rotfeld and Rotzoll (1976) find a positive correlation between advertising and quality (as reported in Consumer Reports and Consumers Bulletin) across all brands in their study of 12 product categories, but not within the subset of advertised brands. In a more comprehensive study, using a sample frame of 196 product categories evaluated by Consumer Reports, Caves and Greene (1996b) find generally low correlation between advertising spending and objective quality. They conclude: “These results suggest that quality-signalling is not the function of most advertising of consumers goods.” More recently, in a case study of the residential plumbing industry, McDevitt (2014) documents a novel use of branding as a signal of product quality. He finds that plumbers with names beginning with an A or a number, placing them at the top of alphabetical directories, “receive more than five times as many complaints with the Better Business Bureau, on average, and more than three times as many complaints per employee” (McDevitt, 2014, p. 910). This result is shown to be consistent with a signaling theory with heterogeneous consumer types in addition to firms with heterogeneous qualities. In equilibrium, low-quality firms use easy-to-find names that cater to low-frequency customers with low potential for repeat business and who will not find it beneficial to engage in costly search to locate the best firms. High-quality firms are less interested in such customers, focusing instead on customers with extensive needs who will devote more effort to searching for a good firm with which they can continue to engage in the future. These results corroborate Bagwell and Ramey’s (1993) observation that cheap talk in advertising can serve the function of matching sellers to buyers. With the rise of online marketplaces with well-established customer feedback mechanisms, it may be interesting to study whether the informational role of brands on consumer choices begins to erode in online markets. For instance, Li et al. (2016) discuss how Taobao’s “Rebate-for-Feedback” feature46 creates a similar equilibrium quality signal as in Nelson’s (1970) money-burning theory of advertising. Of course, signaling is not the only framework in which to interpret uninformative advertising. For instance, in the marketing literature, it is widely believed that such

It would be optimal to write down and estimate a consumer model including all these possible informative effects. Unfortunately, such a model would likely be computationally intractable, and more importantly, these separate informative effects would be hard, if not impossible, to empirically distinguish given my data set.” 46 Sellers have the option to pay consumers to leave feedback about the seller, where the payment is based on a Taobao algorithm that determines whether feedback is objectively informative.



CHAPTER 6 The economics of brands and branding

advertising is useful to create brand associations that help differentiate the brand in the consumer’s mind (see Keller, 2012, Chapter 2 for a survey). More recently, the economics literature has also recognized such a role for advertising via Becker and Murphy’s (1993) notion of “advertising as a good.”

5.4 Umbrella branding 5.4.1 Empirical evidence Many new products are brand extensions that leverage the reputation and/or goodwill associated with an established brand, a practice often termed “umbrella branding” or “brand stretching.” Examples abound including Arm & Hammer, originally a baking soda, which has been extended to toothpaste, detergent, and cat litter; and Sony, a brand name created for a transistor radio in 1955, which has been extended to televisions, computers, cameras, and many other categories. According to Aaker (1990), forty percent of the new products launched in US supermarkets between 1977 and 1984 were brand ext