Developing Cross-Cultural Measurement In Social Work Research And Evaluation [2nd Edition] 0190496509, 9780190496500, 0190496487, 9780190496487, 0190496479, 9780190496470

Developing Cross-Cultural Measurement in Social Work Research and Evaluation, Second Edition is an applied practice-to-r

399 59 6MB

English Pages 0 [217] Year 2018

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Values in evaluation and social research 9780761911548, 0761911545, 9780761911555, 0761911553

563 122 7MB Read more

Qualitative Research in Social Work 9780231161398

355 57 2MB Read more

Developing Practice Guidelines for Social Work Intervention: Issues, Methods, and Research Agenda 9780231508988

The authors advocate the development of readily available, accessible, and professionally sanctioned practice guidelines

175 55 1MB Read more

Evaluation and Social Work Practice [1 ed.] 9780857022066, 9780761957935

Evaluation and Social Work Practice offers a comprehensive treatment of the central issues confronting evaluation in soc

159 18 5MB Read more

Research and the Social Work Picture 9781447338918

There’s a growing pressure for social workers to engage with research and draw on this in practice. But why is this rese

168 101 3MB Read more

Social Work Research for Social Justice 9781403949356

301 47 538KB Read more

Evaluation and Social Research: Introducing Small-Scale Practice 9781350363427, 9780333930953

Evaluation is a large and growing field with applications to a wide range of disciplines – including sociology, social w

218 44 2MB Read more

Current Studies in Educational Measurement and Evaluation 9786057691064

1,447 127 11MB Read more

Uses & Abuses of Social Research in Social Work 9780231899116

184 71 12MB Read more

Understanding and Using Research in Social Work 9781473908147

219 60 1MB Read more

Author / Uploaded
Keith T. Chan
Tam H. Nguyen
Thanh V. Tran

Citation preview

i

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

ii

POCKET GUIDES TO

SOCIAL WORK RESEARCH METHODS Series Editor Tony Tripodi, DSW Professor Emeritus, Ohio State University Determining Sample Size Balancing Power, Precision, and Practicality Patrick Dattalo Preparing Research Articles Bruce A. Thyer Systematic Reviews and Meta-Analysis Julia H. Littell, Jacqueline Corcoran, and Vijayan Pillai Historical Research Elizabeth Ann Danto Confirmatory Factor Analysis Donna Harrington Randomized Controlled Trials Design and Implementation for Community-Based Psychosocial Interventions Phyllis Solomon, Mary M. Cavanaugh, and Jeffrey Draine Needs Assessment David Royse, Michele Staton-Tindall, Karen Badger, and J. Matthew Webster Multiple Regression with Discrete Dependent Variables John G. Orme and Terri Combs-Orme Developing Cross-Cultural Measurement Thanh V. Tran Intervention Research Developing Social Programs Mark W. Fraser, Jack M. Richman, Maeda J. Galinsky, and Steven H. Day Developing and Validating Rapid Assessment Instruments Neil Abell, David W. Springer, and Akihito Kamata Clinical Data-Mining Integrating Practice and Research Irwin Epstein Strategies to Approximate Random Sampling and Assignment Patrick Dattalo Analyzing Single System Design Data William R. Nugent Survival Analysis Shenyang Guo The Dissertation From Beginning to End Peter Lyons and Howard J. Doueck Cross-Cultural Research Jorge Delva, Paula Allen-Meares, and Sandra L. Momper Secondary Data Analysis Thomas P. Vartanian Narrative Inquiry Kathleen Wells

Structural Equation Modeling Natasha K. Bowen and Shenyang Guo Finding and Evaluating Evidence Systematic Reviews and Evidence-Based Practice Denise E. Bronson and Tamara S. Davis Policy Creation and Evaluation Understanding Welfare Reform in the United States Richard Hoefer Grounded Theory Julianne S. Oktay Systematic Synthesis of Qualitative Research Michael Saini and Aron Shlonsky Quasi-Experimental Research Designs Bruce A. Thyer Conducting Research in Juvenile and Criminal Justice Settings Michael G. Vaughn, Carrie Pettus-Davis, and Jeffrey J. Shook Qualitative Methods for Practice Research Jeffrey Longhofer, Jerry Floersch, and Janet Hoy Analysis of Multiple Dependent Variables Patrick Dattalo Culturally Competent Research Using Ethnography as a Meta-Framework Mo Yee Lee and Amy Zaharlick Using Complexity Theory for Research and Program Evaluation Michael Wolf-Branigin Basic Statistics in Multivariate Analysis Karen A. Randolph and Laura L. Myers Research with Diverse Groups: Diversity and Research-Design and Measurement Equivalence Antoinette Y. Farmer and G. Lawrence Farmer Conducting Substance Use Research Audrey L. Begun and Thomas K. Gregoire A Social Justice Approach to Survey Design and Analysis Llewellyn J. Cornelius and Donna Harrington Participatory Action Research Hal A. Lawson, James Caringi, Loretta Pyles, Janine Jurkowski, and Christine Bolzak Developing, Selecting, and Using Measures David F. Gillespie and Brian E. Perron Mixed Methods Research Daphne C. Watkins and Deborah Gioia Content Analysis James W. Drisko and Tina Maschi Group Work Research Charles D. Garvin, Richard M. Tolman, and Mark J. Macgowan

iii

THANH V. TR AN TAM H. NGUYEN KEITH T. CHAN

Developing Cross-Cultural Measurement in Social Work Research and Evaluation SECOND EDITION

1

iv

3 Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and certain other countries. Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016, United States of America. © Oxford University Press 2017 First Edition published in 2009 Second Edition published in 2017 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by license, or under terms agreed with the appropriate reproduction rights organization. Inquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above. You must not circulate this work in any other form and you must impose this same condition on any acquirer. Library of Congress Cataloging-in-Publication Data Names: Tran, Thanh V., author. | Nguyen, Tam H., author. | Chan, Keith T., author. Title: Developing cross-cultural measurement in social work research and evaluation / Thanh V. Tran, Tam H. Nguyen, Keith T. Chan. Other titles: Developing cross-cultural measurement Description: Second edition. | New York, NY : Oxford University Press, [2017] | Series: Pocket guides to social work research methods | Earlier edition published as: Developing cross–cultural measurement. | Includes bibliographical references and index. Identifiers: LCCN 2016027901 | ISBN 9780190496470 (alk. paper) Subjects: LCSH: Social work with minorities—United States—Methodology. | Cultural awareness—United States—Methodology. Classification: LCC HV3176 .T714 2017 | DDC 361.3072—dc23 LC record available at https://lccn.loc.gov/2016027901 1 3 5 7 9 8 6 4 2 Printed by WebCom, Inc., Canada

v

Contents

Preface vii Acknowledgments xiii 1 Overview of Culture and Cross-Cultural Research 1 2 Process and Critical Tasks of Cross-Cultural Research 13 3 Adopting or Adapting Existing Instruments 26 4 Assessing and Testing Cross-Cultural Measurement Equivalence 43 Appendix 4.1 85 5 Item Response Theory in Cross-Cultural Measurement 91 6 Developing New Instruments 129 7 Concluding Comments 164 Appendix 7.1 174 References 177 Index 191

v

vi

vii

Preface

T

he United States historically has been a haven for immigrants and refugees from almost every corner of the globe. As a result, social workers have played a major role in helping newcomers settle and adjust into their newly found communities. Cross-cultural issues are not new among those of us who are trained as social workers or identify ourselves within the discipline. The profession of social work has a long tradition of advocating and serving clients from different cultural backgrounds (Lubove, 1965; Green, 1982). The impact of cultural differences in the implementation of services and the assessment of service outcomes across different social, economic, racial, and national groups is part of the stated mission in social work practice (NASW, 2015) and a core competency in education (CSWE, 2015). Yet, the extent to which social workers are concerned about this in research, education, and practice remains insufficient. This short guide attempts to articulate the process of cross-cultural research instrument development in order to address this gap in social work research and evaluation.

vii

viii

viii

Preface

MIGRATION AND CULTURAL DIFFERENCES People migrate from place to place for different reasons, including but not limited to the economic downturns, political turmoil, religious persecution, war, and calamity. Immigration researchers often classify migration into two groups: the pulled and the pushed immigrants. Pulled immigrants migrate out of their country of origin by choice, and pushed immigrants migrate because of factors beyond their control. In addition, modern transportation technologies and the global economy have opened the borders of nations and continents, allowing more people to easily migrate across those borders. The United Nations report in 2016 that “the number of international migrants—persons living in a country other than where they were born—reached 244 million in 2015 for the world as a whole, a 41% increase compared to 2000” (http:// www.un.org/sustainabledevelopment/blog/2016/01/244-million-international-migrants-living-abroad-worldwide-new-un-statistics-reveal/). This figure includes almost 20 million refugees. By the end of the twentieth century, with the explosion of information science and technology, people around the globe had been exposed to other cultures and were able to virtually and instantaneously connect with foreigners and strangers from every corner of the earth. Changes in US immigration laws from 1965 to 1990 created opportunities for immigrants from diverse ethic/racial backgrounds, in particular from Asian and Latin American countries, to arrive as green card holders and be eligible for citizenship. Although the United States has modified its policy to make immigration more difficult after the historic and tragic 9/11 event (Martin, 2003), the US immigrant population has continued to grow despite these changes. The US foreign- born population has evolved and grown throughout the history of the nation. In 1850, the foreign-born population made up about 2.2% of the US population. This number had increased to 12.9% by 2010. As reported by Camarota & Zeigler (2014), “the nation’s immigrant population (legal and illegal) hit a record 41.3 million in July 2013, an increase of 1.4 million since July 2010. Since 2000 the immigrant population is up 10.2 million,” (p. 1). According to the Pew Research Center, the 2013 US immigrant population had increased more than four times compared to the immigration population in 1960 and 1970. Today, immigrants account for 29% of US

ix

Preface

population growth since the beginning of the twenty-first century. With rapid growth since 1970, the share of foreign-born population in the overall US population has been rising from 4.7% in 1970 to 13.1% in 2013 and could rise to 18% by 2065 (Pew Research Center, 2015). Recent data from the Bureau of the Census revealed that there are 350 languages spoken at homes in the United States. Language diversity can be found in all large US metropolitan areas. The number of languages spoken at homes in large metropolitan areas varied from 126 in the Detroit metro area to 192 in the New York metro area (U.S. Census Bureau, 2015). If English language ability can be used as an indicator of acculturation and adaptation for non–English speaking foreign-born immigrants, only 50% had high English speaking ability. The remaining 50% spoke some or no English at all (Gambino, Acosta, & Grieco, 2014). Roughly speaking, 20 million immigrants in the United States could potentially face barriers to obtaining health and social services due to their limited English language ability. Indeed, US Census data also revealed that there were 11.9 million individuals considered “linguistically isolated,” (Shin & Bruno, 2003). Consequently, this is a challenge for health and social service providers and researchers. With demographic changes and the reality of cultural diversity in the United States and other parts of the world today, social work researchers are increasingly aware of the need to conduct cross-cultural research and evaluation, whether for hypothesis testing or outcome evaluation. This book’s aims are twofold: to provide an overview of issues and techniques relevant to the development of cross-cultural measures, and to provide readers with a step-by-step approach to the assessment of cross-cultural equivalence of measurement properties. There is no discussion of statistical theory and principles underlying the statistical techniques presented in this book. Rather, this book is concerned with applied theories and principles of cross-cultural research, and draws information from existing work in the social sciences, public domain secondary data, and primary data from the authors’ research. In this second edition we made changes throughout the book and added a new chapter on item response theory. We also expanded the chapter on developing new cross-cultural instruments, with a concrete example. Chapter 1 provides the readers with an overview of the definitions of culture, a brief discussion of cross-cultural research backgrounds

ix

x

x

Preface

in anthropology, psychology, sociology, and political science, and the influences of these fields on social work. Chapter 2 describes the process of cross-cultural instrument development and discusses the preliminary tasks of a cross-cultural instrument development process. The chapter offers guides and recommendations for building a research support team for various critical tasks. Chapter 3 addresses the issues of adopting and adapting existing research instruments. The processes and issues of cross-cultural translation and assessment are presented and discussed in detail. Chapter 4 focuses on the analytical techniques to evaluate cross-cultural measurement equivalence. This chapter demonstrates the applications of item distribution analysis, internal consistency analysis, and exploratory factor analysis, and the application of confirmatory factor analysis and multisample confirmatory factor analysis to evaluate the factor structure and testing of cross-cultural measurement invariance. Students learn how to generate data for confirmatory factor analysis, present the results, and explain the statistical findings concerning measurement invariance. Chapter 5 explains the basic concepts of item response theory (IRT) and its application in cross-cultural instrument development. Readers and students will learn how to apply IRT to evaluate the psychometric properties of a scale, especially in the process of selecting the right items for a newly developed scale. Chapter 6 addresses the process of cross-cultural instrument construction. This chapter also demonstrates the process of the development of the Boston College Social Work Student Empathy (BCSWE) scale. This process involved all major steps of instrument construction, including pilot data collection and analysis of the validity and reliability of this empathy scale. Chapter 7 emphasizes the importance of cross-cultural issues in social work research, evaluation, and practices, and provides a list of some online databases that can be useful for cross-cultural research. As the world has become a global community through migration, travel, social media, and trade, the issues of cross-cultural measurement equivalence and assessment techniques are applicable beyond any specific geographical location and can be used with research across different disciplines in the applied professions.

xi

Preface

DATA SOURCES Several datasets are used to provide examples throughout this book. The Chinese (n = 177), Russian (n = 300), and Vietnamese (n = 339) data were collected in the Greater Boston area at various social service agencies and social and religious institutions. These self-administered surveys were conducted to study various aspects of health, mental health, and service utilization among these immigrant communities (Tran, Khatutsky, Aroian, Balsam & Convey, 2000; Wu, Tran & Amjad, 2004). This instrument was translated from English to Chinese, Russian, and Vietnamese by bilingual and bicultural social gerontologists, social service providers, and health and mental health professionals. The translations were also reviewed and evaluated by experts and prospective respondents to ensure cultural equivalence in the translations. The Americans’ Changing Lives Survey: Waves I, II, and III offers rich data for cross-cultural comparisons between African Americans and whites regarding important variables concerning physical health, psychological well-being, and cognitive functioning. This longitudinal survey contains information from 3,617 respondents ages 25 years and older in Wave I; 2,867 in Wave II; and 2,562 in Wave III (House, 2006). The National Survey of Japanese Elderly (NSJE), 1987 has similar research variables to those used in the Americans’ Changing Lives Survey. The purpose of this survey is to provide cross-cultural analyses of aging in the United States and Japan. The 1987 NSJE Survey contains data from 2,180 respondents ages 60 years and older (Liang, 1990). The 1988 National Survey of Hispanic Elderly people ages 65 years and older (Davis, 1997) was conducted to investigate specific problems including economic, health, and social status. This telephone survey was conducted in both Spanish and English. There were 937 Mexicans, 368 Puerto Rican-Americans, 714 Cuban-Americans, and 280 other Hispanics. In this edition we use two new datasets: Children of Immigrants longitudinal Study 1991–2006 (Portes & Rumbaut, 2012), and the Boston College Social Work Student Empathy scale pilot data (2014). These datasets are used because they provide both micro-(within nation) and macro-(between nation) levels of cultural comparisons. The statistical results presented throughout the book are only for illustration. Readers should not interpret the findings beyond this purpose.

xi

xii

xiii

Acknowledgments

W

e appreciate the support from Boston College School of Social Work, the William F. Connell School of Nursing, the School of Social Welfare, University at Albany SUNY, and our families and significant others during the preparation of this publication.

xiii

xiv

xv

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

xvi

1

1

Overview of Culture and Cross-Cultural Research

T

his chapter discusses the concept of culture and reviews the basic principles of multidisciplinary cross-cultural research. The readers are introduced to cross-cultural research in anthropology, psychology, political science, and sociology. These cross-cultural research fields offer both theoretical and methodological resources to social work. Readers will find that all cross-cultural research fields share the same concern—t hat is, the equivalence of research instruments. One cannot draw meaningful comparisons of behavioral problems, social values, or psychological status between or across different cultural groups in the absence of cross-culturally equivalent research instruments.

DEFINITIONS OF CULTURE Different academic disciplines and schools of thought often have different definitions and categorizations of culture. In fact, no agreement 1

2

2

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

has ever been reached in defining culture. This has led to confusion and frustration for many researchers (Minkov & Hofstede, 2013). Scholarly interests in cross-culture studies have their roots in Greece beginning in the Middle Ages (see Jahoda & Krewer, 1997; Marsella, Dubanoski, Hamada, & Morse, 2000). However, systematic studies of cultures originated from the field of anthropology. Edward Burnet Tyler (1832–1917) has been honored as the father of anthropology, and his well-k nown definition of culture has been quoted numerous times in almost every major book and paper on studies of cultures. He views culture as a “complex whole which includes knowledge, belief, art, morals, law, custom, and any other capabilities and habits acquired by man as a member of society” (Tylor, Radin, & Tylor, 1958). Since Tyler’s definition of culture, there have been hundreds of other definitions of culture by writers and scholars from different disciplines and fields. As noted by Chao and Moon (2005), culture is considered to be one of the more difficult and complex terms in the English language. This is also probably true in other languages. Kluckhohn (1954) defined culture as “the memory of a society.” There have also been attempts to define culture by categorizing it into different types. As suggested by Barkow, Cosmides, and Tooby (1992), there are three types of cultures: metaculture, evoked culture, and epidemiological culture. Metaculture can be viewed as what makes human species different from other species. Evoked culture refers to the ways people live under different ecological conditions, and these ecology- based living conditions lead to differences within and between cultures known as epidemiological culture. The conceptualization of this cultural typology suggests reciprocal relationships between psychology and biology in the development of culture and society. Wedeen (2002) suggested a useful way to conceptualize culture as semiotic practices or the processes of meaning-making. Cultural symbols are inscribed in practices among societal members, and they influence how people behave in various social situations. For example, elder caregiving may have different symbolic meanings in different cultures, and how members of a specific culture practice their caregiving behaviors may have different consequences on the quality of life of the recipients. Generally, culture can be viewed as a combination of values, norms, institutions, and artifacts. Social values are desirable behaviors, manners, and attitudes that are for all members of a group or society to

3

Overview of Culture and Cross-Cultural Research

follow or share. Norms are social controls that regulate group members’ behaviors. Institutions provide structures for society or community to function. Artifacts include all material products of human societies or groups. Minkov and Hofstede (2013) noted that there are some cultural artifacts that are culturally specific or unique and thus not feasible or meaningful for a comparative study across societies or cultures. However, there are universal elements of culture such as social values, beliefs, behavioral intentions, social attitudes, and concrete and observable phenomena such as birth rate, crime rate, and such like that can be documented and compared across cultures, groups, and societies. The United Nations Educational, Scientific, and Cultural Organi zation (UNESCO) has defined culture as the “set of distinctive spiritual, material, intellectual, and emotional features of society or a social group, and that it encompasses, in addition to art and literature, lifestyles, ways of living together, value systems, traditions and beliefs.” Moreover, Article 5 of its Universal Declaration of Cultural Diversity declares, “Cultural rights are an integral part of human rights, which are universal, indivisible and interdependent.” (UNESCO). This implies that in cross-cultural research, health, and psychological and social interventions, the respect of linguistic heritage is crucially important. Thus clients or research participants must have the opportunity to communicate in their own languages. Hence cross-cultural translation is undoubtedly an important element in the conduct of cross-cultural research. All things considered, culture can be viewed as social markers that make people unique and distinct from each other based on their country of origin, race, or mother tongue. Although people are different because of their cultures, languages, races, religions, and other aspects, there are universal values and norms across human societies. The challenges of social work research are to investigate the similarities in the midst of obvious differences and diversity.

MULTIDISCIPLINARY PERSPECTIVES OF CROSS-CULTURAL RESEARCH Cultural Anthropology Cultural anthropologists are pioneers in cross-cultural research and have influenced other cross-cultural research disciplines in the social sciences. Burton and White have suggested that cross-cultural research

3

4

4

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

offers a fundamental component of meaningful generalizations about human societies (Burton & White, 1987). Cross-cultural anthropological research has encompassed several key variables or focuses across cultures or societies such as the roles of markets and labor, division of labor and production, warfare and conflicts, socialization and gender identity, reproductive rituals, households and polygyny, gender beliefs and behaviors, expressive behavior, technology, settlement pattern and demography, social and kinship organization, spirits and shamanism, and others (Burton & White, 1987; Jorgensen, 1979). The field of social anthropology can be divided into two schools: social anthropology studies involved in the comparative study of social structures and ethnology and comparative or historical cultural anthropology study cultures (Singer, 1968, p. 527). Two classic theories that have dominated the field of anthropology in the first half of the twentieth century are process–pattern theory (which emphasizes the analysis of cultural pattern) and structural–functional theory (which focuses on the study of cultural structure). Salzman (2001) reviewed and discussed four major theories that have guided cultural anthropology research, including functionalism, which emphasizes interconnection and mutual dependence among societal institutions or customs, and processualism, which focuses on the assumption that members of a society have the power to change the structures and institutions with which they lived or that people are agents of their own actions and behaviors. Materialism theory suggests that economic conditions are the determinant factors of cultural transformation. Cultural patterns theory emphasizes that different cultures have different principles that provide the framework for their unique values. This theory assumes that one can only understand a culture through its own values and perspectives. Culture evolution theory assumes that people, society, and culture change over time. Today, anthropology has become a discipline with several specializations, as Nader (2000, p. 609) noted “For most of the twentieth century, anthropology was marked by increased specialization.” She emphasizes the new direction for anthropology in the twenty-first century as “an anthropology that is inclusive of all humankind, reconnecting the particular with the universal, the local and the global, nature and culture.” She also stressed that culture must be viewed as “part of nature and the changing nature of nature is a subject for all of us” (p. 615).

5

Overview of Culture and Cross-Cultural Research

Cross-Cultural Psychology The fundamental focus of the field of cross-cultural psychology is the understanding of human diversity and how cultural factors or conditions affect human behavior (Berry, Poortinga, & Pandey, 1997). Similarly, Triandis (2000) suggested that “one of the purposes of cross- cultural psychology is to establish the generality of psychological findings” and that “the theoretical framework is universalistic, and assumes the psychic unity of humankind” (Kazdin, 2000, p. 361). A more comprehensive definition of cross-cultural psychology is offered by Berry and associates as “the study of similarities and differences in individual psychological functioning in various cultural and ethnic groups; of the relationships between psychological variables and sociocultural, ecological, and biological variables; and of current changes in these variables” (Berry, Poortinga, Segall, & Dasen, 1992, p. 2). Cross-cultural psychology can be viewed as “the systematic study of behavior and experience as it occurs in different cultures, is influenced by culture, or results in changes in existing culture” (Berry & Triandis, 2004, p. 527). Cross-cultural psychology has at least two traditions involving quantitative methods that rely on statistical comparisons across different cultures and qualitative methods that employ field study methods used by cultural anthropologists (see Berry & Triandis, 2004). Cross-cultural psychology is also defined as the study of overt (behavioral) and covert (cognitive, affective) differences between different cultures (Corsini, 1999, p. 238). Therefore, cross-cultural psychological research generally encompasses systematic analysis, description, and comparisons of various cultural groups or societies to develop general principles that may account for their similarities and differences (Corsini, 1999, p. 238). Cross-cultural psychology has paid close attention to emic and etic approaches in psychological inquiry. An emic approach emphasizes the variations within cultural phenomena, whereas an etic approach focuses on differences across cultural groups. The emic approach is concerned with the unique issues and problems that are specifically found within one culture or group. More specifically, an emic approach studies behaviors, attitudes, and social values inside a culture using instruments developed within it. On the other hand, the etic approach studies behaviors, attitudes, and social values based on the assumption that they are universal and can employ instruments developed outside of a

5

6

6

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

target population or society (Berry, 1969). However, both approaches suffer from conceptual and methodological dilemma. A purely emic approach can hinder true cross-cultural comparisons, and a purely etic approach may not actually reveal true cultural differences because it can impose external concepts and measures on unique culture or society. Malpass (1977, p. 1069) articulated the purpose of cross-cultural psychology as “a methodological strategy and a means of bringing into focus methodological and conceptual issues that are frequently encountered in unicultural research.” Consequently, the development of sound measurement instruments that can capture true cultural differences and articulate cultural uniqueness are the fundamental challenges of cross-cultural psychology research. Cross-Cultural Political Science Research The foundation of political science is built on some fundamental theories such as freedom, democracy, and political equality, and political research focuses on the causal relationships of these theoretical variables (King, Murray, Salomon, & Tandon, 2004). There have been conflicting views of culture within the field of political science concerning the relevance of culture in political inquiry. Some view culture as having no function in the understanding of politics; others see it as the possible causes of political outcomes, such as the development and transformation of democracy across societies (Wedeen, 2002). Cross-cultural political scientists studied cultural changes and their consequences on social and political structures. Political scientists found strong relationships between the values and beliefs of mass publics and the existence of democratic institutions. These researchers identified 12 areas of human values and beliefs that can be found in different cultures of the world: ecology, economy, education, family, gender and sexuality, government and politics, health, individual, leisure and friends, morality, religion, society and nation, and work. Cultural changes that occur as a result of economic development are best understood as an interaction between economic development and the cultural heritage of a society. That is, each society’s unique culture can mitigate or ignite the cultural changes that accompany economic development (Inglehart, Basanez, & Moreno, 1998; Inglehart, 1999, 2000). Studies of political cultures are important to understand ethnic politics and the transformation of

7

Overview of Culture and Cross-Cultural Research

democracy (Pye, 1997). The well-being of individuals and societies are no doubt the product of different political systems. Indeed, political systems and structures can either foster or hinder the development of economic well-being and the protection of human rights. Comparative Sociology European fathers of sociology such as Durkheim and Weber believed that general social laws can be derived from the observed similarities among societies. On the other hand, observed differences among societies provide the understanding of social changes in different areas of the world (Steinhoff, 2001). Functionalism and modernization theories have served as two key theoretical frameworks for cross-cultural sociology. The functionalists view society as a combination of different components, and the society can only function smoothly if different internal components can work together. The modernists see social changes as the consequences of the transition from traditional values to modern (Steinhoff, 2001). Although sociologists often argue that all sociological research involves comparison (Grimshaw, 1973), comparative sociology is recognized as “a method that is deemed to be quite different from other sociological methods of inquiry. When the concept of comparative sociology is used, this phrase then refers to the method of comparing different societies, nation-states or culture in order to show whether and why they are similar or different in certain respects” (Arts & Halman, 1999, p. 1). Kohn (1987, p. 714) attempted to categorize comparative sociology into four types: “Those in which nation is object of study; those in which nation is context of study; those in which nation is unit of analysis; and those that are transnational in character.” Kohn emphasized the importance of cross-national research that views nation as context of study. This type of cross-national research focuses on: “Testing the generality of findings and interpretations about how certain social institutions impinge on personality” (p. 714). Kohn also noted that the term cross-national research is more straightforward than cross-cultural research because cross-cultural research can involve comparisons of different subgroups within a nation. Thus, sociological research involving nations can be referred to as cross- national research, and those involving subcultures within a nation are cross-cultural research.

7

8

8

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

Cross-cultural researchers from these disciplines have employed similar methodologies for data collection and analysis. The differences are their research orientation and interests. Cross-cultural psychology, with its emphasis on the effects of culture on social behavior and social cognition, provides social work with practical guides in designing social and psychological interventions that can confront both unique and common problems among individuals and groups. Cross- cultural– a nthological research can enhance social work research because it provides social work researchers with both theoretical directions and methodological frameworks to understand how people from different cultures cope with their daily life situations. Cross-cultural political science research is useful to global social work practice and research. Understanding cultural factors in political developments is instrumental for social workers who are involved in global practices in various countries or governments. Although this book is not about global social work, globalization is the real process that is permeating every society and every culture (Tomlinson, 1999). How each society or political system reacts and accommodates globalization will impact the well-being of its citizens. Social work will have to determine how to confront the negative impacts of globalization on members of different societies. Comparative sociology research from both functionalism and modernization perspectives can be useful for cross-cultural research. For example, social work researchers can draw from these two major theoretical frameworks in their efforts to explain the breakdown of family systems as the breakdown in the communication of different family members or the ability of family members to move from traditional values to modern. The study of the causes of family dysfunction across society is beneficial for social work in the development of culturally sensitive and appropriate family interventions. The inquiry of the influences of culture on human behavior is also fundamental to social work as we attempt to devise and provide effective interventions to both individual and societal problems in different cultural contexts. The four dimensions of cultural values established by Hofstede (1980) and dimensions of personality established by McCrae and John (1992) could be useful for social work in developing cultural competence guidelines and training. It is fair to say that cross-cultural social work research differs from other cross-cultural disciplines in its context and implications.

9

Overview of Culture and Cross-Cultural Research

Cross- cultural social work research can borrow both theories and methodologies from other disciplines and refine them to meet the new challenges of our profession in this global and diverse society.

ISSUES IN CROSS-CULTURAL SOCIAL RESEARCH AND EVALUATION The common aim of cross-cultural or -national research in the social sciences is the systematic comparison of human behaviors, social values, and social structures and how these variables influence individuals or social systems between or among different cultural groups, including nations, societies, or subcultural groups within a larger national or social system. The social work profession has a long history of recognizing the importance of cultural influences on human behaviors and social work practices. In the Blackwell Encyclopedia of Social Work, Robinson (2000, p. 222) suggests that “To meet the needs of culturally diverse populations, social workers must have an understanding of culturally consistent assessment, evaluation and treatment skills, as well as theoretical content.” This prescription of multicultural social work appears to embrace the ideas of cross-cultural comparability of assessment, outcomes, treatment implementations, and causal explanations. The National Association of Social Workers (NASW) has its own standards for cultural competency (NASW, 2001). In its document, NASW endorses 10 cultural competence standards for its members encompassing ethics and values, self-awareness, cross-cultural knowledge, cross-cultural skills, service delivery, empowerment and advocacy, diverse workforce, professional education, language diversity, and cross-cultural leadership. Among these 10 standards, standard 4 describes cross-cultural skills with the following statement: “Social workers shall use appropriate methodological approaches, skills, and techniques that reflect the workers’ understanding of the role of culture in the helping process.” This standard is relevant and important to social work researchers and evaluators. This standard implies that social work researchers and evaluators should employ appropriate research instruments, methodologies, and statistical methods in conducting cross- cultural social work research and evaluation. The Council on Social Work Education (CSWE) also emphasizes the importance of cultural

9

10

10

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

diversity in its educational policy and accreditation standards (CSWE, 2001). It requires that accredited social work programs must “educate students to recognize diversity within and between groups that may influence assessment, planning, intervention, and research. Students learn how to define, design, and implement strategies for effective practice with persons from diverse backgrounds” (CSWE, 2001, p. 9). Both NASW’s standard of cultural competence and CSWE’s educational policy and accreditation standards emphasize the importance of recognizing the cultural dimension of social work practice. This is also the foundation of cross-cultural social work research and evaluation. The following are a few examples of common goals of cross-cultural social work research and evaluation: 1. To understand how people from different cultures cope with their life situations, including economic, physical, psychological, and social situations; 2. To identify the risk factors of psychological and social pathology across cultures; and 3. To evaluate the appropriateness, effectiveness and impacts of social policies, programs, and interventions on the well-being of people from different cultures. From the intervention perspective, we can define cross-cultural social work as the implementation of evidence-based practices (interventions) across different cultural groups, communities, societies, and nations. Therefore, cross-cultural social work research and evaluation involves the study of the appropriateness and efficacy of evidence- based social work interventions across different cultural groups. The fundamental issue is whether a treatment or intervention is culturally appropriate and produces similar outcomes among clients of different cultures. We can also frame this issue as whether social workers can successfully implement a treatment or intervention proved to be effective for one cultural group for other groups that have different cultural backgrounds. The answer is contingent on several issues, but the most important is the cross-cultural equivalence of treatment and outcome measures. One cannot draw a valid conclusion about the efficacy of a treatment across different cultural populations if the treatment was operationalized and implemented in different manners for each cultural

11

Overview of Culture and Cross-Cultural Research

group. Furthermore, if the outcome measure bears no similarities in psychometric properties, then the comparison is not warranted. Having a meaningful, appropriate, and practical research instrument or questionnaire is a prerequisite for the quality of cross-cultural social work research and evaluation. This allows social work researchers and evaluators to collect the correct data for either cross-cultural hypothesis testing or outcome evaluation. There are two issues concerning measurements in cross-cultural social work research and evaluation: conceptual equivalence and statistical equivalence. Conceptual equivalence is the first step in designing and planning a cross-cultural research or evaluation project. This requires that key research variables and outcome measures must bear a conceptual equivalence across cultural groups of clients or participants. More specifically, both independent variables and dependent variables must bear similar meanings between two or among several comparative groups. Conceptual equivalence encompasses both linguistic equivalence and cultural equivalence. Linguistic equivalence refers to the equivalence of the translation of a concept or an instrument between the selected languages of clients or participants. Cultural equivalence requires that a research variable or treatment must be understood and accepted by clients/participants who come from different social structures with unique social orientations and value systems such as individualistic vs. collective social orientation. Social work researchers and evaluators can achieve linguistic and cultural equivalence for the selected research variables or treatments via cultural translation and cultural evaluation of the content, format, and structure of the questionnaire items or research instruments. For example, when a question is developed to collect information on a symptom of depression, we need to ensure that such a symptom exists across cultural groups, that the language used in each cultural group reflects the meaning of such depressive symptom, and that the format of the question or item and the overall structure of the questionnaire are culturally relevant and appropriate for all cultural groups. Both linguistic and cultural equivalence may be achieved by employing appropriate cross- cultural translation procedures and expert evaluations. How does a social work researcher know that a scale or an outcome measure has conceptual equivalence in two or more comparative cultural groups? The answer to this question relies on the

11

12

12

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

information collected from both professionals and laypersons representing the cultural groups under investigation. This can be done through literature review, focus group meetings, town hall meetings, in-depth interviews with professionals and laypersons or prospective clients, and statistical analyses. When a scale is translated from one language to another, there are cross-cultural translation procedures that researchers and evaluators should follow. This book discusses and illustrates the use of both descriptive statistics and confirmatory factor analysis to evaluate cross-cultural equivalence of research instruments. It should be noted that although this book emphasizes the importance of measurement equivalence in cross- cultural social work research and evaluation, the issues of cultural sensitivity and cultural appropriateness are the foundation of all types of social work research and interventions. Social work researchers may employ sound cross- cultural equivalent instruments in their research, but they will encounter resistance or non-cooperation from the target population if they are culturally insensitive and inappropriate in their relationship with the community. Similarly, evidence-based interventions must be implemented in different ethnic communities with cultural sensitivity and appropriateness. In chapter 2. We will address the process and critical tasks of cross- cultural research. The process can be applied at local (national) and international settings. Most of the practical issues are discussed from a national perspective. Researchers may have to modify some aspects of this process in global or international settings.

13

2

Process and Critical Tasks of Cross-Cultural Research

Research instrument is defined as a systematic and standardized tool for data collection. It includes all types of research questionnaires and standardized scales. There are three methods of cross-cultural research instrument development: adopting an existing instrument, adapting or modifying an existing instrument, and developing a new instrument. To develop a cross- culturally valid questionnaire or instrument, the concepts or constructs selected for the investigation must be clearly defined and bear the same meanings across the selected cultural groups. Having a clear definition of measurement constructs is a matter of utmost importance for all levels of cultural comparative research and evaluation, whether it is a sex or racial/ethnic comparison within one society or across nations. As Smith (2004) noted, “An essential goal of cross-national survey research is to construct questionnaires that are functionally equivalent across populations” (p. 3). Adoption of existing instruments uses direct translation of the research instrument or questionnaire from the source language to the 13

14

14

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

target language without considering cultural differences. This is an efficient and cost-effective method, but it suffers from potential problems of cultural inequivalence because researchers often pay too much attention to the linguistic equivalence and ignore the culturally conceptual equivalence. The problem with a measurement adoption approach is the assumption of cultural equivalence between the source language and the target language. The researchers impose their cultural values or biases on the target population by assuming that the instruments they adopt carry the same meanings between their culture and the target culture. Adaptation of an existing instrument calls for careful translation and modification to achieve cultural equivalence between the source language and the target language (Pan & De La Puente, 2005). This approach requires the researchers or the research team to be competent in both cultures. There are different levels of adaptation, including the elimination of a part of the original instrument, replacing some items of the original scale with new items, or the application of a multilevel translation procedure that avoids verbatim translation and thus emphasizes the equivalence of concepts, not the equivalence of language between cultures. Development a new instrument often requires the researchers to start from a vague idea and go through numerous iterations of identifying and defining the research concepts and variables, transforming concepts to variables, and developing questions or items to capture the meanings of the research variables across different research populations. The flowchart in Figure 2.1 outlines a process of cross-cultural instrument development. This flowchart demonstrates a self-reflective process in that one must always evaluate the current phase and go back to the previous phases for revision or modification until all members of the research team can accept the final instrument. The first four steps of the process are similar for the efforts of developing a new instrument and adopting or adapting an existing instrument. The process of cross- cultural questionnaire development illustrated in Figure 2.1 will be developed throughout the book. The flowchart in Figure 2.1 delineates the necessary steps in the process of cross- cultural instrument development. It begins with the preliminary tasks, including formulating the research aims from

15

Research Aims

Setting Boundary and Target Populations

Setting Boundary and Target Populations

Identifying & Defining Research Variables Literature Review Developing New Instruments

Adopt or Adapt Exiting Instruments

Developing new Indicators

Translation

Cross-cultural Evaluation

Cross-cultural Evaluation

Pilot Testing

Revision

Pilot Testing

Instrument Assessment

Final Instrument

Figure 2.1 The Process of Cross-Cultural Instrument Development.

16

16

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

the context of cross-cultural settings. The sources and motivation of research aims are generally as follows: 1. Funding agencies or institutions. Researchers are expected to develop the research aims that reflect the interests of a funding agency or institution. In this situation, researchers often do not have the flexibilities to do what they want to do. 2. Community or national needs. The researchers have to develop research aims that address the needs of communities or nations. 3. Individual and team interests. Researchers are free to do what interests them with less external constraints. Once the research aims, questions, or hypotheses are identified, formulated, and agreed upon by key research personnel, the scope and cross-cultural comparisons, and the target populations must be thoroughly discussed. Some of the keys elements of this task are: 1. Research locations. Choosing the geographical areas to conduct the research project depends on the available resources, the representative nature of the locations, and the availability of and accessibility to the research participants. 2. Number of comparison groups and sample sizes. The research team must agree on the number of comparison groups and the number of participants representing each comparison group. 3. Key characteristics of participants. It is also important to decide on selection criteria such as age, sex, education, incomes, and nature of social or psychological illness. Representatives from the selected target populations, communities, or groups must be included in the evaluation of the research aims, questions, and hypotheses before other research tasks can be developed and implemented. The selected research aims must have similar levels of importance and meaningfulness among the participating cultural groups or populations. The key research variables are derived from the research aims, questions, and hypotheses. The task of identifying and defining the selected research variables is as important as identifying and formulating the cross-cultural research aims. The variables must reflect the aims from a cross-cultural perspective.

17

Process and Critical Tasks of Cross-Cultural Research

After the variables are identified and defined, researchers can adopt or adapt existing instruments or develop new ones. Conducting a comprehensive and systematic literature review can help researchers find existing instruments for their research aims. If there is a need to develop new instruments, then researchers should involve both experts and prospective participants in their attempts to define and measure the research variables. Adopting and adapting existing instruments requires the application of multilevel translation procedures to successfully translate the existing instruments from a source language to the target languages. Developing new instruments that can be used across different cultural or linguistic groups is complicated and expensive. In either case, systematic assessments of conceptual equivalence and measurement equivalence are required before the implementation of data collection and analysis. The process illustrated in the flowchart suggests two general levels of analysis, qualitative and quantitative. The qualitative level involves the following: 1. Formulating the common research aims. Researchers can use experts, in-depth interviews, and focus groups during the process of formulating the cross-cultural research aims for their projects. 2. Setting the research boundaries and populations. It is important for the researchers to have current demographic information about the prospective research populations. 3. Generating common hypotheses and variables across cultural groups. The involvement of experts, community leaders, practitioners, and prospective research subjects through in- depth interviews and focus groups is needed. 4. Building the research team and support staff. The research team and support staff must be both methodologically and culturally competent. All team members must be familiar with the cultures of the research populations. If possible, research team and support staff should also be members of the research groups, communities, or societies. 5. Establishing rapport with the communities and prospective participants. Well-designed research methods and instruments are necessary but not sufficient for the success of the research project. The researchers cannot implement their projects without

17

18

18

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

the participation of the research populations. Establishing rapport with community leaders and providing relevant information about the projects to the communities are essential. The research team may use existing social media technology to recruit and invite the community to engage in the research projects. The above tasks require in-depth discussion, exchange of ideas, and negotiations among all key stakeholders. It is always challenging to find consensus among a diverse group of stakeholders. The quantitative level involves the following: 1. The collection of data. 2. Assessing cross-cultural validity and reliability of the instruments. Careful preparation and planning for data collection is crucial for the success of data analyses. Assessing cross-cultural validity and reliability involves the establishment of cross-cultural psychometric equivalence, or the similarities in the reliability and factorial structures of the measurement of key variables across cultural groups. Using the internal consistency coefficient and the correlation of an item with the overall scale is one way to check for statistical equivalence of a composite scale or index across cultural target groups (Van de Vijver & Leung, 1997). If a scale has five items, then the correlation of each item with the overall scale should be similar across the comparative groups. When a scale has a similar α-coefficient and correlation of item and overall scale, it is considered to have one of the indicators of statistical equivalence. Subsequently, researchers and evaluators can employ explanatory and confirmatory factor analyses to evaluate the factorial equivalence of a scale or an index across selected cultural groups. However, this does not constitute cross-cultural equivalence. One needs to look further into the factorial equivalence of such a scale. For a scale to be considered to have conceptual and statistical equivalence between two cultural groups under the investigation, its items (designed to capture the selected behavior, attitude, or psychosocial problem) must bear similar meanings in both cultural groups and should exhibit similar statistical evidence in terms of reliability and factorial structures.

19

Process and Critical Tasks of Cross-Cultural Research

The results of descriptive reliability analysis, factor analysis, and multisample confirmatory factor analyses will help the researchers to decide whether an instrument can be used across different research populations. This book will illustrate the application of SPSS, STATA and LISREL for the assessments of cross-cultural equivalence of the research instruments.

PRACTICAL RECOMMENDATIONS FOR CROSS-CULTURAL INSTRUMENT DEVELOPMENT As mentioned earlier, the quality of the results of a cross- cultural research project depends on the quality of its measures. Following are a few practical recommendations for cross-cultural research instrument development. Research Aims. A cross-cultural research project without meaningful and clear research aims would only lead to confusion. Funding sources often determine the scope and aims of a cross-cultural research project. In this situation, the researchers respond to specific instructions of the funding institution, and the research aims must articulate the vision and goals of the funding institution. Regardless of funding sources, cross-cultural social work researchers should always involve different stakeholders (e.g., clients, service providers, community leaders) in defining and formulating the research aims. The researchers should be aware that their research aims must bear similar meanings across cultural groups, and the expected outcomes will improve or enhance the well-being of the target populations in areas such as health, mental health, quality of life, or living standards. Setting Scope and Target Populations. Cross-cultural research involves systematic comparisons of selected variables between two or among more cultural groups. The researchers need to identify the key variables for the comparisons, among which they will make such comparisons. Obviously, the scope of a cross-cultural research project is defined by the available resources, the feasibility of the project, and the participation of the target populations. The research team should seek community support from various cultural groups at local levels. If the research involves cross- national comparisons, then seeking support from governments, institutions, and political leaders becomes crucial for the success of the project.

19

20

20

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

Cross-Cultural Evaluations of Research Aims. Once the preliminary research aims, variables, and target populations are identified, the research team should call for at least one focus group of stakeholders to evaluate the importance and relevancy of the research aims. With current Internet communication technology such as SKYPE (http://w ww.skype.com/en/ features/), the research team can communicate with each other long distance not only by voice, they can also participate from all over the world in group video conferences to discuss their research project, send text messages, and documents. There are three basic questions that require clear answers before the research project can move to the next phases: 1. Are the proposed research aims meaningful among the cultural groups? 2. Do the proposed aims have cross-cultural equivalence among the cultural groups? 3. Will the potential results have any meaningful impacts on the quality of the life of the comparative populations? If these questions are not satisfactorily answered, then the research team should revise the research aims until they are acceptable. Identifying and Defining Variables and Literature Review. Once the research aims are evaluated by the research team and stakeholders, the research team must identify and define the relevant variables that reflect the research aims. The variables must bear similar meanings across the selected comparative groups. Although Internet “chat room” and video conferences over the Internet can be useful, the dynamic of face-to-face interaction among the research team members is important for the process of identifying and defining the research variables. It is suggested that a focus group involving the research team and stakeholders is necessary to generate meaningful variables for the project. The goals of the focus group should include the following: 1. Develop a comprehensive list of variables representing the research aims. 2. Select the variables that appear to carry some levels of cultural equivalence among the comparative groups. 3. Define the selected variables that capture cultural differences both within and between groups.

21

Process and Critical Tasks of Cross-Cultural Research

Once the variables are selected and defined, the researchers should review the existing literature for the operationalization of the variables— that is, to identify any existing instruments that can be used to measure the selected variables. If the team members and the expert consultants agree that an existing instrument is adequate and appropriate for the project, the next step is to translate the instrument into the target languages. The translation will be subjected to in-depth evaluation and pilot testing before it can be used. This chapter is devoted to the procedure of cross-cultural translation. If there are no existing instruments, the research team needs to develop a valid and reliable instrument that can be used across the selected cultural groups. Cross-Cultural Research Support Team. Building a strong research support team is one of the most crucial tasks to secure the successful implementation of the research projects. The support team includes cultural experts, community advisors, interviewers for cognitive interviews, moderators for focus groups, and translators for the translation process.

CULTURAL EXPERT RECRUITMENT CRITERIA The research team will have to rely on cultural experts to assist the process of instrument development from the evaluation of the existing instruments to the construction of new instruments. The following criteria can be used as a guide to select cultural experts to participate in the development and evaluation of indicators. Language Proficiency. It is difficult for a researcher who has no or limited language skills to conduct a research project on an ethnic or cultural group whose language is different from the researcher’s language. In such a situation, the researcher should bring to the team individuals who have the language skills to serve as cultural experts for the research project. Relevant Professional Credentials. A good cultural expert must have appropriate training in the field related to the problem under investigation. Having the right language skills relative to a cultural group is necessary for a cultural expert, but this is not sufficient. This expert must also be trained in the field. A cultural expert who is fluent in the language of the research population and highly educated but who lacks formal training related to the research project would not be able to

21

22

22

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

help the research team to evaluate the research instrument effectively. When it is difficult to find ideal cultural experts, the researcher should offer adequate training to the prospective experts before allowing them to engage in the instrument development and evaluation process. Knowledge of History, Geography, Arts, Literature, and Customs. The cultural expert should have some knowledge of the history, geography, arts, literature, and customs of a research population, giving the researchers insights into the understanding of how members of that group behave. The cultural expert who possesses these qualities will better help the research team throughout the process of instrument development. Bilingual and Bicultural Ability. When the researchers borrow a research concept that has been well defined in one language (e.g., English), it is useful to have cultural experts who are bilingual and bicultural to help the research team with the translation of the selected concept to the target language.

INTERVIEWER RECRUITMENT CRITERIA Interviewers will be needed to conduct both cognitive interviews and pilot survey interviews. The interviewers should have the following characteristics: 1. Knowledge of the culture of the target research group or population. 2. A sense of respect and tolerance for cultural differences. 3. Ability to communicate with interviewees. 4. Flexibility to adapt to interviewees’ situations.

PARTICIPANT RECRUITMENT CRITERIA Participants from different target populations will be needed to participate in cognitive interviews, focus groups, and pilot surveys. They should be selected based on the following criteria. Having an ability to communicate openly. Selected participants should have an ability to speak for themselves and on behalf of others.

23

Process and Critical Tasks of Cross-Cultural Research

Because the researchers want to learn how members of a cultural or ethnic group communicate their feelings and attitudes toward a particular social or psychological phenomenon, it is important to recruit prospective participants who can help the researcher to define such a phenomenon in a common language that can be understood by other group members. Participants’ observations of friends and relatives can also give the researcher valuable insight into the concerns of interest. Having previous experience relevant to the research concepts or problems. If the researchers want to develop a measure of coping with disasters, then only those prospective participants who have had direct or indirect experiences with disasters should be invited to participate in in-depth interviews. Having the willingness to share experiences with others. Participants who are willing to discuss their feelings or experiences with others will help the researchers to collect relevant information in defining and articulating the research concept.

FOCUS GROUP MODERATOR RECRUITMENT CRITERIA The quality of the data collected from the focus groups also depends on the quality of the focus group moderator. An effective focus moderator should have the following qualifications. Cultural Competency. The moderator has to have an in-depth knowledge of the culture of the target population to appreciate the nuances of behaviors and attitudes of the participants in light of their own cultural meanings and context. This will help generate participants’ engagement in the group discussion. Language Competency. The moderator is expected to be fluent in at least two languages to fully facilitate the focus group in the evaluation of linguistic and cultural comparability of the research instrument. Creativity. The moderator has to be able to think on his/her feet and be able to create group dynamics that generate meaningful discussion and evaluation of the research instrument. Analytical Skills. Good analytical skills help the moderator to raise meaningful questions and guide participants to the right discussion. Verbal Skills. The moderator should be able to communicate and articulate the research aims, as well as the meanings of the research

23

24

24

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

variables and instruments, to the participants. The moderator’s verbal skills will stimulate participants to stay focused on their tasks during the focus group meetings. Detail-Oriented. The moderator should be able to follow through on the details of the instruments, research procedures, and goals and objectives of the research instruments. This ability will help the moderators to stay on course and probe for subtle information from the participants. Listening Skills. The success of a focus group largely depends on the ability of the moderator to listen to the participants. Listening and showing respect to participants’ ideas will help them to reveal the information they would not reveal if they thought the moderator was not interested in them. Empathy. Being able to share participants’ feelings and appreciate their attitudes in a group setting will hold the attention of participants and enhance their willingness to share their views and thoughts about the purpose of the focus group meeting.

COMMUNITY ADVISOR RECRUITMENT CRITERIA Having knowledgeable and committed committee advisors will strengthen the research project. Following is a list of people the researchers may want to invite to serve as advisors for a cross-cultural research project. Prospective Consumers Research. Professionals who will use the results of the research to improve their practices and interventions affecting the lives of their clients. Prospective Participants or Subjects. Individuals who are members of the research population from which the research will be conducted. These are the people who will be directly affected by the outcomes of the research. Community or Civic Leaders. Individuals who hold notable community leadership positions in the community such as religious leaders, educators, politicians, and leaders of various social organizations in the community in which the research will be conducted.

25

Process and Critical Tasks of Cross-Cultural Research

Trained Professionals in the Field of Interest. Researchers who have conducted research that is similar to the current project or who have special training in areas related to the current research project. This chapter has provided a list of essential tasks and recommendations for the researchers to prepare before engaging in the construction and assessment of cross-cultural research instruments. Without clear and meaningful research aims, it is difficult for the researchers to move ahead with the project. Similarly, without a strong research support staff and the right participants, the research team will not be able to successfully develop meaningful research instruments. Chapter 3 is about the translation of existing research instruments from a source language to different target languages. The potential pitfalls of research instrument translation are numerous, including both technical and conceptual aspects. Readers will find examples of hands- on experience and practical guides to carry out the translation of a research instrument successfully.

25

26

3

Adopting or Adapting Existing Instruments

CROSS-CULTURAL TRANSLATION AND RELATED ISSUES Cross-cultural researchers in health, psychology, social sciences, and social work often have used existing measures or scales from English and translated them into the target languages. Both adopting (using the original scale as it is) and adapting (modifying the original scale) existing research instruments often require translating the selected instrument from a source language to a target language. Cross-cultural translation is one of the major tasks in cross-cultural research. The task of translation becomes more challenging when an instrument is translated into two or more target languages simultaneously, especially with the translation of special constructs (Symon, Nagpal, Maniecka-Bryla, Nowakowska, Rahidian, Khabiri & Wu, 2013; Price, Klaassen, Bolton- Maggs, Grainger, Curtis, Wakefield & Young, 2009). This chapter will (1) review existing cross- cultural translation approaches and offer the reader practical guidelines; (2) present a multilevel translation process encompassing back-translation, expert evaluation, cognitive interviews, focus group evaluation, and field evaluation; and (3) offer a guide for best practices in selecting translators to perform 26

27

Adopting or Adapting Existing Instruments

cross-cultural translation. Several examples will be presented to illustrate potential biases in cross-cultural translation and cross-cultural data analysis. Researchers have employed a variety of translation approaches, and it appears that no single approach has become universal. In cross- cultural psychology, Brislin, Lonner, and Thorndike (1973) recommended that cross-cultural translation should involve back-translation, bilingual techniques, a committee approach, and pretest. More specifically, back-translation is a process wherein an instrument is translated from its original language to a different language, and the translated version of the instrument is translated back to the original language to assure conceptual equivalence. Maneesriwongul and Dixon (2004) reviewed 47 studies that involved cross-cultural translation and concluded that there is a lack of consensus on standards for cross-cultural translation. There are three common approaches of cross-cultural translation that have been used: 1. A forward-only translation approach is a one-way translation from the source language to a target language. This approach is not recommended because of the lack of reliability and validity evaluation. 2. Forward translation with testing approach is stronger, but it also does not address the issue of cross-cultural validity and reliability. 3. A back-translation approach appears to be stronger than the previous two approaches, but it also tends to emphasize the literal translation or linguistic equivalence, which does not constitute cross-cultural equivalence. Harkness (2003) has offered a more desirable approach to the survey questionnaire translation process called Translation, Review, Adjudication, Pre-testing and Documentation (TRAPD). This approach is a committee-based approach that involves translators, reviewers, and adjudicator. Committee or team approaches to translation have been recognized as more effective compared with other approaches (Guillemin, Bombardier, & Beaton, 1993; Harkness, Pennell & Schoua- Glusber, 2004). The translation can be performed via parallel translations or split translations. In parallel translations, translators work

27

28

28

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

independently before submitting their work for committee review and evaluation. When two or more nations and groups share the same language, each group can translate a part of the whole research instrument. Neither the committee approach nor the TRADP process address the issue of sex representation in translation or evaluation. In this book, sex representation is required in the questionnaire construction and translation process. Harkness (2003) has suggested that translators should be “skilled practitioners who have received training on translating questionnaires” (p. 36). Skilled translators should have adequate knowledge of the research topic and target population to perform valid translation. It would be difficult for a translator whose formal training is in engineering perform a translation of a research questionnaire for a study of depression. This same principle should also be applied to the reviewers of the questionnaire translation and those who adjudicate the final translation of the questionnaire. Harkness also recommends the use of “team approaches” for cross-cultural translation of research instruments. The key advantage of this approach is that team members with diverse backgrounds will help the team to determine the best translation outcomes (Guillemin, Bombardier, & Beaton, 1993). Figure 3.1 illustrates the cross-cultural translation process from the source language (e.g., English) to the target language (e.g., Vietnamese). Because there is no gold standard for cross-cultural translation procedures, this book combines the previously used cross-cultural translation procedures into the following schematic description. This translation process is time consuming and requires the researchers to carefully select advisory committees and translators and to use different methods of translation evaluation to achieve quality cross-cultural translation of an instrument. The five-step flowchart depicted in Figure 3.1 illustrates a comprehensive procedure for cross- cultural translation involving one target group.

ADVISORY COMMITTEE The first step in the cross-cultural translation procedure is to form an advisory committee that will work with the research team to select an appropriate instrument or a survey questionnaire that can be used to collect data for the research project. This advisory committee should be

29

Adopting or Adapting Existing Instruments

Advisory Committee

Translation from English to Vietnamese Female

Translation from Vietnamese to English Female

Translation from English to Vietnamese Male

Translation from Vietnamese to English Male

Evaluation

Pilot Testing

Final Instrument

Figure 3.1 Cross-Cultural Translation Process.

composed of professionals who are trained in the area of the research interest and who understand the nature and scope of the research problems or questions concerning the research population. The rationale for these selection criteria is straightforward—only individuals who are trained in the area of research interest can help the research team to select the appropriate instruments or measures for the research purposes. These individuals also need to be culturally and linguistically

29

30

30

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

competent. If the research project involves more than two groups, then each group should have at least one female and one male expert representative serving on the advisory committee. This sex representation principle should apply for all translation activities. More than one expert in each group is called for to assure the diversity of voices and perspectives. Throughout the book, the authors emphasize the representation of both sex in all key instrument development processes because sex is a complex variable encompassing socialization, identity, and communication (Cameron, 1988).

FORWARD AND BACKWARD TRANSLATION The second step is to recruit and hire competent bilingual translators. The research team should work with the advisory committee to screen, hire, and provide adequate training to the selected translators on key aspects of the research project. For each linguistic group there should be at least four translators, two females and two males, who work independently to perform translation and back-translation from the source language (e.g., English) to the target language (e.g., Vietnamese) and vice versa. The translators must have adequate understanding of the research aims and key research concepts to perform a valid translation. Translators must be instructed to avoid verbatim translation and, rather, to emphasize the comparability of the concepts and ideas between the source and target languages.

EVALUATION The third step is to evaluate the translation. The research team should employ multi-evaluation methods, including expert appraisal and review (evaluation committee), cognitive interviews, focus groups, and pilot testing. An independent group of experts or the same advisory committee should evaluate the translation and back-translation versions of the translated instrument. This group should meet with the translators to clarify and verify their translations. The evaluation matrix presented in Table 3.1 can be used for the expert group evaluation. The evaluation matrix allows the research team to collect both quantitative and qualitative evaluation information. Each evaluator is asked to respond “yes” or

31

Table 3.1 Cross-Cultural Translation Evaluation Matrix Member 1 Evaluation Item 1 Language Clarity Appropriateness Difficulty Relevance Item 2 Language Clarity Appropriateness Difficulty Relevance Item 3 Language Clarity Appropriateness Difficulty Relevance Item K Language Clarity Appropriateness Difficulty Relevance

Yes Explanation

Member 2 No Explanation

Yes Explanation

Member 3 No Explanation

Yes Explanation

No Explanation

32

32

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

“no” to each evaluation criterion and to provide a brief explanation for the rating. Simultaneously, the team should conduct cognitive interviews and focus groups as integral parts of the evaluation process. Language Clarity Committee members evaluate the use of words and syntax of the translated item and its back-translation. Appropriateness Committee members determine whether the translated items are culturally appropriate in both language and meaning for the target population. Difficulty Committee members determine whether the translated items are difficult for prospective respondents or participants to understand and to respond. Relevance Committee members determine whether the translated items are culturally relevant to the participants’ experiences in real-life situations.

COGNITIVE INTERVIEWS As mentioned earlier, cognitive interviews should be used to evaluate the research questionnaire (Rothgeb, Willis & Forsyth, 2005). The cognitive interview is a form of in-depth interview that allows the prospective research subjects to express their feelings toward the research instrument in terms of its appropriateness, usefulness, and meanings, and to explain how they understand the questionnaire and how they respond to it. The research team works with the evaluation committee to recruit a representative sample of at least 6 to 12 prospective research subjects for each target group to participate in the cognitive interviews. The sample size recommended here is similar to the sample sizes recommended for other types of qualitative research (Morgan, 1988). Think- aloud and verbal probing approaches of cognitive interviewing should be used concurrently. The interviewers begin the interviews by reading the translated item to the participants and asking them to say whatever comes to their minds as they listen to the item. Subsequently, the interviewers should use the criteria in the evaluation matrix (see Table 3.1) to probe for more information. Digital recording should be used as the means of data collection.

33

Adopting or Adapting Existing Instruments

Interviewers must be trained to have a clear understanding of the purpose of the research project and the meaning of the research questionnaire and its items. Interviewers must possess skills to probe for in- depth answers and encourage the participants to reveal their thoughts and feelings concerning the translated items of the questionnaire. Interviewers should probe the participants regarding four aspects of the quality of the translation: language clarity, cultural appropriateness, difficulty, and relevance. Clarity refers to the use of words and syntax of the items. Appropriateness refers to the suitability of the translated items to the research participants’ culture and values. Difficulty refers to the participants’ cognitive ability to respond or react to each translated item. Relevance refers to the connection of the translated items to respondents’ real-life experiences within their cultural context.

FOCUS GROUPS Focus groups can be conducted simultaneously with cognitive interviews as a part of the evaluation process. The research team should conduct at least three focus groups: one female group, one male group, and one combined sex group. The single-sex group allows a more comfortable atmosphere for the participants to bring up issues that are sex-related before they are discussed in both-sex groups. Each linguistic or cultural group should have a minimum of six participants (Fayers & Machin, 2007). Having within-sex and between-sex groups and culture-group meetings sounds complicated, but they are desirable because these meetings will generate rich information necessary for a comprehensive assessment of cross-cultural equivalence. Group members should receive the complete translation of the questionnaire at least one week prior to the focus group meeting. Focus group participants are asked to review and discuss the quality of the translation using the four quality criteria presented in Table 3.1. All members must have an equal opportunity to discuss their ideas concerning the translation. The moderator must ensure the participation of each member during the focus group meetings. Discussions should be recorded on audio or video. Each meeting should be no more than 2 hours, because the participants will be more likely to lose focus and interest in longer meetings.

33

34

34

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

DATA ANALYSIS AND SYNTHESIS Evaluation data compiled from committee evaluations, cognitive interviews, and focus groups are transcribed and edited for an overall evaluation. Translators, advisory committees, and the researchers will work together to revise and refine the research instrument or survey questionnaire for a field pilot testing.

PILOT TESTING The purpose of the pilot testing is to establish the feasibility of the research instrument, evaluate the quality of interviewing methods, and assess any sources of missing data and other aspects of the implementation of the research instrument. Structured pilot interviews via survey are conducted for reliability and validity evaluations. If possible, telephone interviews, face-to-face interviews, and mail surveys can be used simultaneously to improve the validity of the translation. A reliable and valid translated questionnaire should produce similar data regardless of the data collection methods. The sample size of the pilot survey test is determined by the size of the instruments and resources. From the factor analysis perspective, each item of a scale requires the minimum number of 5 to 10 subjects (Fayers & Machin, 2007). If the questionnaire consists of several standardized scales, then the team can use the scale with the largest number of items or questions as the guide to determine sample size. Random sampling is always a desirable method to draw a sample, but it is often not feasible. Therefore, the research team should make every effort to recruit individuals from diverse backgrounds of the target population to participate in the pilot test. Finally, once data from the pilot tests are compiled and analyzed, the evaluation committee will review the results and work with the research team to finalize the translation. The research team will combine data collected from expert groups, cognitive interviews, focus groups, and pilot survey testing to finalize the research instrument for data collection. The process of translation is time-consuming and can be expensive. However, the benefits of having a valid and reliable translation outweigh the cost and time that the researchers have to spend to achieve the most desirable translated instrument. Researchers should continue

35

Adopting or Adapting Existing Instruments

to update and refine the translations of cross-cultural instruments to catch up with the changes in languages, culture, and communication.

CROSS-CULTURAL TRANSLATION ISSUES AND BIASES There are several issues and biases that affect cross-cultural translation, such as the quality of translated instruments and potential problems in using secondary data analysis for cross-cultural comparisons. These issues and biases often are the results of attempts to replicate existing measurements that were developed in a source language such as English for studies among people whose primary languages are not English. Biases also arise out of pooling data from various sources for global comparisons. Some of the potential biases in cross-cultural translation and analysis can be avoided with careful planning and implementation of translation procedures and analysis.

MULTILINGUAL TRANSLATION When two or more linguistic groups are involved in a comparative study, the cross-cultural translation process becomes more complicated. The procedure illustrated in Figure 3.1 must be modified to achieve optimum translation outcomes. The composition of the advisory and evaluation committees must include individuals from all participating groups. Members of these committees should be able to communicate through a common language. For example, if a researcher plans to study depression among Vietnamese and Russian immigrants in the United States, then the evaluation committee members are expected to be bilingual (English–Vietnamese and English–Russian).

RECRUIT AND TRAIN TRANSLATORS In addition to language proficiency, prospective translators are expected to have knowledge of the research field and the culture of the target population. They should be trained to have a good understanding of the research aims and the meanings of the research variables. Translators

35

36

36

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

must also have good communication and listening skills to work with other translators and the research team. When there are no available trained and experienced translators, the team must design a special training program for prospective translators. This training program should include cultural characteristics, linguistic requirements such as terminologies related to the research topics, and substantive knowledge of the research area.

RECRUIT SUBJECTS FOR EVALUATION Prospective research subjects or participants from the comparative cultural groups or communities are recruited to participate in the process of evaluating the translation. These individuals need to have good language skills in their own language and critical thinking ability to evaluate the quality and accuracy of the translation. They are people who can represent their own ethnic or cultural groups. It is expected that these individuals have a good understanding of the research aims, the underlying purposes of the research instruments, and an ability to speak for themselves and on behalf of their communities.

RECRUIT AND TRAIN INTERVIEWERS FOR TRANSLATION EVALUATION Interviewers are recruited and trained to conduct cognitive interviews for the evaluation of the translation. They must go through rigorous training, including instruction on how to read the questionnaire clearly and probe for appropriate information, record and maintain quality data, and respect and protect participants’ confidentiality.

VERBATIM TRANSLATION When a researcher uses an instrument developed in a language that is different from the language of the target population, verbatim translation of the selected research instrument is not recommended. Equivalence of language translation often confuses respondents from a different culture. For example, in the CESD scale, there is the item

37

Adopting or Adapting Existing Instruments Table 3.2 Corrected Item–Total Correlation of “Shake off the blues” and “As good as other people” Shake off the blues As good as other people

Vietnamese

Russian

.634 .013

.675 .044

“I felt that I was just as good as other people.” This item appears to be straightforward and should be easily translated to other languages, but it turns out that when it is translated into Russian and Vietnamese, the item has very poor reliability. Looking back on our translation process, we did not pay as much attention to this item as we did with other difficult items such as, “I felt that I could not shake off the blues even with help from my family or friends.” The translation team put more effort into translating this latter item than to the item “I felt that I was just as good as other people.” To avoid verbatim translation of those items that appear to be straightforward, the translators should give equal attention to all items. In the case of the item “I felt that I could not shake off the blues even with help from my family or friends,” two groups of translators (three females and three males) worked independently to translate the difficult language of the questionnaire. The two groups met with the principal investigator (P.I.) to have an open discussion on the similarities and differences of the group translations to arrive at a consensus. The group discussed meanings of the difficult items in English and their translated meanings in Vietnamese. The final translated items did not have language equivalence but did have cultural and conceptual equivalence. Table 3.2 demonstrates the reliability of these two items in the Russian and Vietnamese sample. Differences in “Corrected Item–Total” correlation between different cultural groups is a sign of poor cross-cultural comparability (van de Vijver & Leung, 1997).

SINGLE TRANSLATOR The use of a single translator is not recommended because of the lack of validity and reliability checks for the quality and accuracy of the translation (Maneesriwongul & Dixon, 2004). This bias becomes even more

37

38

38

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

serious when the researcher has no language skills and is foreign to the target population.

TRANSLATORS WITHOUT APPROPRIATE TRAINING BACKGROUND Translators who have no training background relevant to the research questions will not be able to perform valid and reliable translation of the research instrument. For example, a translator who is fluent in the language of the target research population, but who has no training in public health or social work, would not be able to produce a quality translation of a research instrument designed to collect data on health and mental health problems.

FAILURE TO EVALUATE THE VALIDITY AND RELIABILITY AFTER TRANSLATION One should never assume that if an instrument has been well developed for one particular group, it is sufficient to translate it and replicate it in another group without a reaffirmation of its validity and reliability. It is recommended that researchers always evaluate the factor structure or configuration of the scale and its internal consistency once it is translated to the target language.

INCONSISTENCY OF TRANSLATION PROCEDURES When cross-cultural comparison is performed on a variable that has been translated from one original language to different target languages, one should be mindful of the consistency of the translation procedures. Inconsistency of the translation procedures will impact the equivalence of reliability and validity of the measurement of the research variables, and the results could be severely biased. Table 3.3 illustrates differences of internal consistency as a result of inconsistent translation procedures. Each of the studies presented in Table 3.3 employs different procedures of cross-cultural translation. The statistics presented in Table 3.3 suggest that the three items of the CESD Scale have somewhat weak cross-cultural equivalence. The

39

Adopting or Adapting Existing Instruments Table 3.3 Corrected Item–Total Correlation of Three CESD Items CESD Items

Chinese (N = 175)

Japanese (N = 2119)

Russian (N = 299)

Vietnamese (N = 339)

I felt depressed I felt lonely I felt sad

.587 .472 .507

.483 .565 .524

.591 .637 .712

.790 .674 .710

Cronbach’s Alpha

.697

.705

.798

.853

“depressed” item has cross-cultural equivalence between the Chinese and Russian samples. The “I felt lonely” item has a similar corrected item–total correlation between the Russian and Vietnamese samples. The “I felt sad” item also has a similar corrected item–total correlation between the Russian and Vietnamese samples. When we examined the equivalence of Cronbach’s α-coefficients across the four samples, the only two samples that exhibited similar Cronbach’s α-coefficients are the Chinese and the Japanese. Nevertheless, the corrected inter-item correlation of the “I felt sad” item appears to be different between the two samples. Different reliability statistics presented in Table 3.3 could be either cultural, methodological, or both. Cultural variations are more difficult to recognize than methodological variations. From the methodological perspective, these studies employed different translation procedures. The Japanese survey used “an extensive translation, back- translation, retranslation, and pilot testing process in order to ensure a very high degree of comparability,” (Sugisawa et al., 2002, p. 791; Liang et al., 2005). The Russian and Chinese surveys used one professional bilingual translator, committee evaluation, and pilot testing to ensure comparability (Tran, Khatustsky, Aroian, Balsam, & Conway, 2000; Wu, Tran & Amjad, 2004). Overall, the translation procedure of the Japanese survey was more desirable than the Russian and Chinese surveys. The Vietnamese studies used an approach to translation similar to the Japanese survey; however, it emphasized the use of equal numbers of female and male translators to avoid sex biases in translation (Tran, Ngo & Conway, 2003). These statistics suggest that one should be cautious in using the composite scores of these items for the purpose of cross-cultural comparisons, especially when one compares the meaning of the items across groups.

39

40

40

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

INCONSISTENCY IN RESEARCH DESIGNS AND DATA COLLECTION When pooling data from different studies for statistical comparisons, researchers must account for the variations of design and data collection methods. The results in Table 3.3 also suggest that the variation of study designs can affect the reliability and validity of the research instrument.

CULTURAL VARIATION WITHIN THE SAME LINGUISTIC POPULATION Researchers could be mistaken in assuming that language is the only marker of cultural similarities or differences. Sharing the same language does not imply an equivalence of measurement. Table 3.4 presents the corrected item–total correlation of five items that were designed to measure “negative” feelings or negative well-being among three groups of Hispanic elderly individuals (Davis, 1997). Although the respondents are members of three ethnic groups that share similar languages (Spanish and English), the statistics presented in Table 3.4 suggest that the selected items of negative well-being exhibit variation in reliability. The statistics presented in Table 3.4 indicate that there is a variation of reliability of the items across the three ethnic groups that share similar languages. These three major groups of Hispanics do not share the same culture. They have different immigration histories and are very diverse in terms of family values, religion, and socioeconomic backgrounds (Bean Frank & Gillian, 2003; U.S. Census Bureau, 2001). This suggests that researchers should not assume that similarity of languages is equivalent to similarity of measurement. It is always a good practice to Table 3.4 Corrected Item–Total Correlation of Five Bradburn’s Negative Items Negative Items

Survey (N = 2235)

Cuban (N = 692)

Mexican (N = 757)

Puerto Rican (N = 358)

Restless Lonely Bored Depressed Upset

.480 .526 .532 .574 .294

.432 .451 .430 .536 .243

.459 .540 .523 .566 .292

.570 .528 .598 .648 .261

Cronbach’s Alpha

.722

.665

.718

.755

41

Adopting or Adapting Existing Instruments

verify the equivalence of the measurement of the variables used among the comparative groups, even if they share a language.

DIFFERENT LANGUAGE OF INTERVIEW WITHIN A CULTURAL GROUP Language of interview can influence the reliability of an instrument even if it is used within the same cultural group. Respondents of The National Survey of Hispanic Elderly could be interviewed in English or in Spanish. A small number of respondents chose to be interviewed in English (n = 308), and the majority were interviewed in Spanish. It should be noted that the survey instrument was developed in English and translated into Spanish. Table 3.5 contains the reliability analysis of five negative item scales. Although three of five items appeared to have similar corrected item–total correlation, the item “Restless” had a poorer correlation in the sample of English interviews, whereas the item “Upset” had a poorer corrected item–total correlation in the sample of Spanish interviews. Researchers can control for this problem by making sure that interviewers in different languages follow the same protocols by statistical control. Languages of interview can also be considered as a covariate in multivariable analysis, such as multiple regression analysis or other multivariate statistical procedures. This chapter provides an overview of existing translation procedures and illustrates a multilevel translation process and procedure. The authors used existing cross-cultural data to demonstrate potential biases Table 3.5 Corrected Item–Total Correlation of Five Bradburn’s Negative Items Between English and Spanish Negative Items

English Interviews (N = 308)

Spanish Interviews (N = 1991)

Restless Lonely Bored Depressed Upset

.369 .580 .524 .523 .370

.496 .518 .533 .581 .284

Cronbach’s Alpha

.722

.724

41

42

42

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

of cross-cultural translation. The quality of the translation of adopted or adapted instruments is determined by its reliability and validity. The researchers can use the pilot evaluation data to decide whether an instrument is ready to be implemented. The techniques to assess the cross-cultural reliability and validity of the research instruments will be explained and illustrated in Chapters 4 and 5. Chapter 6 is devoted to the process of developing and constructing new cross-cultural research instruments.

43

4

Assessing and Testing Cross-Cultural Measurement Equivalence

This chapter illustrates multiple statistical approaches to evaluating the cross-cultural equivalence of the research instruments: data distribution of the items of the research instrument, the patterns of responses of each item, the corrected item–total correlation, exploratory factor analysis (EFA), confirmatory factor analysis (CFA), and reliability analysis using the parallel test and tau-equivalence test. Equivalence is the fundamental issue in cross-cultural research and evaluation. A cross-cultural comparison can be misleading for two reasons: (1) comparison is made using different attributes, and (2) comparison is made using different scale units. But even when the problems of equivalence in attributes and scale units are resolved, it does not guarantee a valid cross-cultural comparison. In every step of the research process, the researcher must ensure the same attention is paid to equivalence in concepts, operationalization, methods, analysis, and interpretation. 43

44

44

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

There has been a general consensus that multigroup (sample) confirmatory factor analysis offers a strong approach to evaluating cross-cultural equivalence of measurement properties. Researchers have proposed different procedural steps in the testing of measurement equivalence hypotheses, including the equivalence of the covariance matrices of the observed indicators and the equivalence of factor means among groups. Joreskog and Sorbom (2001) recommend the testing of five general hypotheses, such as (1) equivalence of covariance matrices of observed indicators of a scale or an instrument, (2) equivalence of factor patterns of observed indicators, (3) equivalence of factor loadings of observed indicators on their respective factors, (4) equivalence of measurement errors of observed indicators, and (5) equivalence of factor variances and covariance across groups. Generally speaking, invariance of factor pattern and factor loadings is sufficient to determine whether a construct can be measured across different cultural, national, or racial groups. In comparative research (whether it is cross-cultural, cross-national or multigroup comparisons), the assumption of measurement equivalence is crucially important. If non-equivalent measures were used, the outcomes would be seriously biased (Hui & Triandis, 1985; Poortinga, 1989). This lack of equivalence tends to disproportionately impact vulnerable populations like ethnic minorities, the elderly, immigrants, sex non- conforming persons, and other health disparities populations. This is because the most often-used measurements were first developed and validated for largely mainstream white populations. This can result in inaccurate assessments in the evidence of disparity for the most vulnerable, leading to poor resource allocation to these populations. Additionally, in terms of intervention research, biased measurements can potentially fail to detect real changes in clinical trials for health disparities population, leading to Type 2 errors in interpretation. On the other hand, inaccurate measurements can also suggest that a tested treatment works, but in fact doesn’t, leading to Type 1 errors in interpretation.

GENERAL ISSUES IN CROSS-CULTURAL EQUIVALENCE COMPARISON From methodological perspectives, there are at least four general issues of cross-cultural equivalence: concept, measurement, data, and analysis.

45

Assessing and Testing Cross-Cultural Measurement Equivalence

Conceptual equivalence requires that a concept or variable be defined similarly across cultural groups. Measurement equivalence requires that the research instruments used to collect data for the defined variables bear the same meanings and psychometric properties. Data equivalence requires that data be collected in the same manner across cultural groups, such as the use of the same sampling designs and data collection techniques (i.e., telephone interviews, face-to-face interviews, or mail survey). Equivalence of analysis requires that the same statistical methods be used for the data across groups. For example, if one wants to compare sex differences in self-esteem among three ethnic groups, then it is assumed that self-esteem is defined similarly and understood by subjects among the three comparative ethnic groups, a self-esteem scale that has similar psychometric properties among the three groups is used, data on self-esteem are collected using the same sampling designs and data collection techniques, and one statistical test such as a t-test is used to compare mean differences in self-esteem between females and males among the three groups. This chapter focuses on the use of descriptive statistics, reliability analysis, and factor analysis as the techniques to ascertain cross-c ultural equivalence of selected measurements for researchers who are not familiar with more advanced methods of cross-c ultural measurement equivalence evaluations. Concrete examples from existing research will be used to elucidate the process and procedure of cross-c ultural equivalence evaluation. Table 4.1 highlights the major tasks that are needed to deal with potential problems in establishing equivalence of research instruments in cross-c ultural research.

RESEARCH DESIGNS All comparative studies should employ similar sampling designs, data collection techniques, and data analysis procedures. Details of the research designs should be listed side by side for all research populations or groups, including the procedures used to develop sampling frames, sample size estimates, and power analysis.

45

46

46

Developing Cross-Cultural Measurement in Social Work Research and Evaluation Table 4.1 Cross-Cultural Equivalence Issues in Research and Comparison Solution Tasks When Using Data from Different Sources or Different Cultural groups Cross-Cultural Equivalence Issues

Comparison Tasks & Tools

Research designs

Examine the comparability of research design issues between studies: Content Analysis Examine the comparability of data distribution, response pattern, missing data, etc.: Descriptive Statistics Examine the comparability of measurement reliability of measures used in selected studies: Internal Consistency, Parallel Test, and Tau-Equivalence Test Examine the comparability of the factorial structure of a scale or measure across groups Examine the preliminary factor pattern of a scale between groups Test the goodness of fit of the factorial structure of a scale within each group Test the comparability of measurement properties of a scale between groups Compare group comparability of item difficulties, item characteristic curve (ICC), differential item functioning (DIF)

Descriptive analysis Reliability analysis

Validity: Factorial Structure of latent dimensions of a scale or measure Exploratory factor analysis (EFA) Confirmatory factor analysis (CFA) Multigroup CFA Item Response Theory

DESCRIPTIVE ANALYSIS Descriptive analysis requires the researcher to evaluate the shape of data distribution and the pattern of responses for each item of a scale or an instrument between the comparative groups. Differences in data distribution and pattern of responses of the instrument items are early signs of differences in reliability and validity of the research instrument between the comparative groups. In the following example, 10 items of the Rosenberg Self Esteem Scale were evaluated across two samples: Vietnamese youth and Cuban youth living in the United States. These samples represent multiple levels of cross-cultural research: they are both refugee groups living in the United States, coming from two different countries of origin with

47

Assessing and Testing Cross-Cultural Measurement Equivalence Table 4.2 10-Item Rosenberg Self-Esteem Scale 1. I am a person of worth. 2. I have a number of good qualities. 3. I am inclined to feel I am a failure. 4. I do things as well as other people. 5. I do not have much to be proud of. 6. I take a positive attitude toward myself. 7. I am satisfied with myself. 8. I wish I had more respect for myself. 9. I certainly feel useless at time. 10. At times I think I am no good.

different cultures, values, practices, and languages. The data comes from the Children of Immigrants Longitudinal Study (CILS), which sampled children with at least one foreign-born parent or children born abroad but brought to the United States at an early age (Portes & Rumbaut, 2012). Table 4.2 describes the items, or questions asked to respondents in the Rosenberg Self-Esteem Scale.

DATA DISTRIBUTION: EQUIVALENCE IN DATA DISTRIBUTION OF THE ITEMS This matter is verified by an analysis of the shape of the item distribution. We recommend that all data cleaning and descriptive analyses be conducted in SPSS or STATA. The SPSS procedures of DESCRIPTIVE and FREQUENCIES or STATA commands TABULATE and SUMMARIZE will provide the statistics in Tables 6.2 and 6.3. The scores of each item ranged from 1 to 4 (1 = Agrees a lot, 2 = Agrees a little, 3 = Disagrees a little, 4 = Disagrees a lot) in all samples. This is important because different scoring patterns produce different means and shapes for the data. Two important statistics in Table 4.3 that need to be examined and compared across groups are the kurtosis and skewness. Skewness is a measure of symmetry and can be understood as “tails.” If the distribution of the data on a particular item is normally distributed, its scores spread out equally on both sides of its mean. If there is a longer “tail” on one side of the mean, then the distribution of the data is skewed. Kurtosis indicates whether the

47

48

48

Developing Cross-Cultural Measurement in Social Work Research and Evaluation Table 4.3 Skewness and Kurtosis of 10 Self-Esteem Items Across Two Cultural Groups Item

Vietnamese Skew

(n = 215) Kurtosis

Cuban Skew

(n = 566) Kurtosis

Person of worth Have good qualities Feel I am a failure Do things as well Not much to be proud Positive attitude self Satisfied with self Wish respect self Feel useless Think no good

1.29 1.11 −0.83 1.66 −0.74 1.05 0.88 0.01 −0.07 −0.24

4.05 3.81 2.49 5.49 2.33 3.78 2.98 1.85 2.13 1.81

2.05 1.83 −1.93 1.52 −2.07 1.35 1.33 −0.28 −0.44 −0.99

7.29 6.53 5.97 5.07 6.30 4.33 4.18 1.59 1.89 2.61

distribution of an item is peaked or flat relative to a normal distribution (NIST/SEMATECH e-Handbook of Statistical Methods, http://w ww.itl. nist.gov/div898/handbook/, 2008). A high level of kurtosis may indicate that the data clusters around certain responses and that there is a lack of variation in the distribution. In the context of measurement equivalence assessment, we expect that the skewness and kurtosis of each item of a scale or an index have similar values or distribution shapes across the comparative groups. Differences in the values of skewness and kurtosis suggest nonequivalence in psychometric properties, such as reliability and factor loadings. It is important to note that if analysts wish to perform transformations to help reduce skew and kurtosis, it is important to perform the same procedures for all populations, in order to make comparisons. In some cases where responses cluster on one score (for example, if almost every respondent reports the same response), transformations will not substantially improve skew or kurtosis. This is because transformations will even out a distribution, but cannot introduce variation to data. Among the two samples, the Vietnamese sample seems to have more items (6 out of 10) that have values of skewness that approach 0. There are only three items in the Cuban sample that have values of skewness that approach 0. Overall, all of the skewness values in the two samples are fairly modest (less than a value of 2 in absolute value). With respect to the kurtosis values, the items “do things as well” and “person of worth” have greater values compared to other items in the

49

Assessing and Testing Cross-Cultural Measurement Equivalence

samples. However, they are more extreme in the Cuban (5.97 and 7.29) samples than in the Vietnamese sample (5.49 and 4.05). We will see how that difference in data distribution of the items can affect the reliability of the scale among these comparative samples.

PATTERN OF RESPONSES The next step is the examination of the pattern of responses for each item across the samples. This descriptive analysis can provide some preliminary information on how individuals from different cultural groups respond to a specific question or item of a scale or an instrument. Table 4.4 presents the patterns of responses of 10 self-esteem items across the two samples. The results reveal significant cross-cultural differences in patterns of item responses, especially in the items “Person of worth,” “Have good qualities,” and “Not much to be proud.” Overall, Cubans seem to report much higher self-esteem on these items, which may be attributable to actual differences between the two groups. On the other hand, the presence of such differences across the two groups in these items and not others suggests that there may be biases in these particular items.

INTERNAL CONSISTENCY ANALYSIS This analysis is based on the average correlation of all items on the scale. Higher average correlation among the items suggests higher internal consistency. The Cronbach’s Alpha, or α-coefficient, is the summary statistic that indicates how well the items of a scale “tie together” in measuring an overall construct or variable. If a researcher develops a measure of self-esteem that is composed of 10 items, these scale items, by construction, should correlate highly among themselves. This is because the items, collectively, should measure the same construct of self-esteem. Examining the pattern of “corrected item– total correlation” or “item–rest correlation” can shed light on potential cross-cultural biases in the reliability of the selected measure. In this example, a STATA-based reliability analysis was used to illustrate this procedure. The analysis includes the specification of a reliability model for the α, item statistics, and standardized values for the scale and all items.

49

50

50

Developing Cross-Cultural Measurement in Social Work Research and Evaluation Table 4.4 Pattern of Item Responses: Vietnamese and Cuban Youth Responses (%) Item Person of Worth Vietnamese Cuban Have Good Qualities Vietnamese Cuban Feel I am a failure Vietnamese Cuban Do things as well Vietnamese Cuban Not much to be proud Vietnamese Cuban Positive attitude self Vietnamese Cuban Satisfied with self Vietnamese Cuban Wish respect self Vietnamese Cuban Feel useless Vietnamese Cuban Think no good Vietnamese Cuban

Agrees a lot

Agrees a little

Disagrees a little

Disagrees a lot

54.42 71.73

32.56 22.44

7.91 3.53

5.12 2.30

55.35 73.67

35.35 23.32

7.91 2.47

1.40 0.53

6.98 1.41

18.14 7.60

25.12 15.90

49.77 75.09

63.72 67.31

27.91 27.56

5.12 4.42

3.26 0.71

9.77 4.24

17.67 6.71

26.98 12.54

45.58 76.50

43.26 60.25

43.26 29.68

8.37 7.95

5.12 2.12

46.51 61.31

34.88 28.80

14.42 8.30

4.19 1.59

19.53 18.20

30.70 25.09

28.37 17.49

21.40 39.22

11.16 9.19

33.95 26.68

34.88 22.61

20.00 41.52

15.35 7.77

26.51 17.31

27.44 16.61

30.70 58.30

The most important outcomes that need to be interpreted are the item–rest correlations in STATA, which are item–total correlation statistics in SPSS. This statistic allows the researcher to identify which item of a scale has good or poor correlation with the sum of

51

Assessing and Testing Cross-Cultural Measurement Equivalence

all items on the scale. It tells the researchers how well a particular item correlates with all other items in a scale. The results in Table 6.4 revealed that all but two items in the Cuban sample have greater correlations with the overall scale compared to the Vietnamese sample. Six of 10 items in the Cuban sample have a correlation greater than or equal to 0.50 compared to three in the Vietnamese sample. As a result, the Cuban sample has the largest Cronbach’s α-coefficient compared to the Vietnamese sample. These results are related to the findings in Tables 4.3 and 4.4. So far, the results tend to indicate that these items have weak cross-cultural equivalence in their psychometric properties. It should be noted that the results in Table 4.5 reveal that the items in the sample of Cubans have relatively greater item–rest correlation coefficients than the Vietnamese. Consequently, the α-coefficient of internal consistency reliability is also greater in the sample of Cubans. Different values of item–rest correlation coefficients across the two groups indicate that the measure of self-esteem lacks cross-cultural equivalence in reliability. Table 4.5 Standardized Item–Rest Correlation, Cronbach’s α if Dropped and Overall Scale Cronbach’s α Vietnamese (n=215) Item Person of worth Have good qualities Feel I am a failure Do things as well Not much to be proud Positive attitude self Satisfied with self Wish respect self Feel useless Think no good Overall Scale Cronbach’s α

Cuban (n=566)

Item–rest Correlation

Cronbach’s α Item–rest Cronbach’s α if dropped Correlation if dropped

0.31 0.55 0.45 0.34 0.53

0.77 0.74 0.75 0.77 0.74

0.40 0.53 0.59 0.49 0.49

0.84 0.82 0.82 0.83 0.83

0.50 0.46 0.39 0.36 0.49 0.77

0.74 0.75 0.76 0.76 0.75

0.67 0.59 0.41 0.58 0.59 0.84

0.81 0.82 0.84 0.82 0.82

51

52

52

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

EXPLORATORY FACTOR ANALYSIS Factor analysis has been used to evaluate the validity of research instruments. This statistical method allows researchers to identify the dimensions of a latent construct via the covariance structure of the items used to measure the construct. Factor analysis can be used as a tool to identify the underlying dimensions of a set of variables or items designed to measure a construct such as self-esteem or depression. The method provides information on the relationship between an item and its respective latent dimension (factor) in a unidimensional or multidimensional construct (Child, 1990). There are some basic requirements for the use of factor analysis: 1. Data should be collected randomly. 2. Items should be measured on interval or ratio level of measurement. Items should have univariate and bivariate distribution shape. 3. Items should have multivariate normality. In reality, these assumptions are often ignored. In the subsequent section of this chapter, we discuss the use of a polychoric matrix for confirmatory factor analysis in SEM, which makes it possible to evaluate the validity of a scale with binary items, or items that are heavily skewed or kurtotic. However, there are some conditions that must be observed carefully, including adequate sample size and sufficient number of items for each respective factor. The rule of thumb is that each item should have at least 10 cases or 10 observations. If a scale has 10 items, researchers need a minimum of 100 cases or observations to arrive at meaningful results (Nunnally, 1978; Pedhazur & Schmelkin, 1991). In addition, each factor should have from three to five items, and it cannot have less than two items (Bentler, 1976; Kim & Mueller, 1978). There is also confusion between the use of principal component analysis (PCA) in EFA. As Costello and Osborne (2005) noted, “PCA is not a true method of factor analysis and there is disagreement among statistical theorists about when it should be used” (pp. 1–2). They suggest that “factor analysis is preferable to principal component analysis” because PCA “is only a data reduction method,” whereas “the aim of

53

Assessing and Testing Cross-Cultural Measurement Equivalence

factor analysis is to reveal any latent variables that cause the manifest variables to covary” (p. 2). PCA is a commonly used procedure in data reduction, and some version of the technique is generally used to compress file structures for photographs, music files, and other large data files. There are many useful applications of PCA, but the goal of factor analysis in social science research is not data reduction. Rather, it is to examine how the measurement of items can be understood in its most essential meaning. Through the use of factor analysis, the analyst can generate empirical information to indicate how certain items in a scale contribute to components, or factors, to the construct. For example, in a depression scale, factor analysis might reveal that certain items capture the affective factor of depression, while other items capture the somatic expression of depression. Although many social work researchers are familiar with SPSS for their data analysis and use principal component analysis as the default method of extraction in SPSS, it is recommended that other methods are considered first in analysis. One should consider using other methods of factor extraction, including unweighted least squares, generalized least squares, maximum likelihood, principal axis factoring, α-factoring, or image factoring in the evaluation of cross-cultural equivalence of a selected scale or instrument. Fabrigar, Wegener, MacCallum, and Strahan (1999) suggest that maximum likelihood be used when the data approach a normal distribution. This method offers several measures of goodness of fit for the factor model as well as tests of statistical significance. When data are not normally distributed, principal axis factoring can also be used. Although there are many factor extraction methods, maximum likelihood and principal axis factoring are considered the best choice, because they would provide the best solutions (Costello & Osborne, 2005). Although STATA has options for 14 types of rotation methods, principal axis factoring is the default option when conducting exploratory factor analysis. The data presented in Tables 4.3 and 4.4 reveal that the 10 Rosenberg Self-Esteem Scale items are not normally distributed across the two samples. Therefore, principal axis factoring should be the best choice for the evaluation in the exploratory phase of cross-cultural equivalence in the factor structure of the 10 items. These 10 items can be considered to have cross-cultural equivalence across the four selected samples

53

54

54

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

representing cultural, racial, ethnic, and national differences when they meet the following conditions: 1. Having the same number of factors or latent variables across the selected comparative groups. 2. Having the same factor loadings for each factor across the selected comparative groups Other assumptions that should be considered are: 1. Use of similar sampling methods. 2. Use of similar data collection methods. 3. Items are measured on the same numerical scale. The results of the EFAs for the 10 Rosenberg Self-Esteem items across the samples of Vietnamese and Cubans are based on the following factor analysis procedures: 1. Factor extraction: principal axis factoring. 2. Rotation method: Varimax with Kaiser Normalization. The role of the rotation procedure in factor analysis is to produce a clearer structure of the items. Factor rotation is necessary to improve the interpretability of the factor structure. Once the numbers of factors are extracted from the data, one needs to determine the relationships among the factors. Orthogonal rotation produces a set of factors that are independent from each other, whereas oblique rotation produces the correlated factors (Thompson, 2004). For example, if a factor analysis was performed with 10 items, with five items designed to measure self- esteem and the other five to measure life satisfaction, an orthogonal rotation would be the best procedure. This is because a rotation method that would produce a factor model with two independent factors would provide the best conceptual fit, with one factor for self-esteem and the other for life satisfaction. Varimax method is an orthogonal rotation method that minimizes the number of variables that have high loadings on each factor and simplifies the interpretation of the factors. Oblique minimization, or the oblimin method, produces intercorrelated factors. Quartimax method minimizes the number of factors needed to explain each variable.

55

Assessing and Testing Cross-Cultural Measurement Equivalence

Equamax method is a combination of the varimax method, which simplifies the factors, and the quartimax method, which simplifies the variables. Promax rotation allows factors to be correlated and is useful for large datasets. Among the five rotation methods listed, varimax method is considered the most useful because it produces factors that are uncorrelated (Costello & Osborne, 2005). The results of a cross-cultural equivalence analysis of the factor structure of the 10 Rosenberg Self-Esteem items are presented in Table 4.6. Note that in this table only those factor loadings that are greater than 0.35 are retained. Tabachnick and Fidell (2001) suggested a value of 0.32 as the minimum value for a factor loading. In Table 4.6, the Difference column delineates the absolute difference in the value of the corresponding factor loadings between the groups. For example, the difference of the factor loading of item “Do things as well” between the Vietnamese and Cubans is 0.07 (0.57−0.50 = 0.07). A difference of 0.05 or greater could be a sign of a potential lack of cross-cultural equivalence. Table 4.6 Exploratory Factor Analysis for Cross-Cultural Equivalence of Factor Structures Vietnamese (n = 215) Item Person of worth Have good qualities Feel I am a failure Do things as well Not much to be proud Positive attitude self Satisfied with self Wish respect self Feel useless Think no good

Cuban (n = 566)

Positive Negative Positive Negative Self-Esteem Self-Esteem Self-Esteem Self-Esteem Difference 0.48 0.65

0.50 0.63 0.53

0.57

0.42

0.02 0.02 0.48

0.50

0.05 0.07

0.38

0.38

0.17

0.61

0.58

0.47

0.03

0.56

0.53

0.39

0.03

0.44 0.74 0.72

0.04 0.10 0.03

0.55

0.48 0.64 0.69

55

56

56

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

Note that for the Cuban sample, there were four items (Feel I am a failure, Not much to be proud, Positive attitude self, Satisfied with self) which had loadings higher than 0.35 for both factors. This can be problematic for analysts because these items may not be conceptually distinct—in this example, this may mean that these items can potentially be understood as to belong to either the factor for positive self-esteem or, when reversed, negative self-esteem. This illustrates the inherent issues in using EFA as a technique. This exploratory factor analytic process is driven by data rather than by theory. Theoretically, the Rosenberg Self-Esteem Scale was designed to capture positive self-esteem with a set of five items and negative self-esteem with another set of five items. In the Vietnamese sample, we can see this work out as intended. However, in the Cuban sample this is not as clear-cut. In fact, the initial Eigenvalues before a varimax rotation supported a two-factor solution for Vietnamese (Positive Self-Esteem: 2.68; Negative Self-Esteem: 1.01), while for the Cuban sample a one-factor solution was indicated, based on the cutoff score of 1 as the Eigenvalue (Positive Self-Esteem: 0.53; Negative Self-Esteem: 3.54). However, since theory indicated that the items should line up into two conceptual dimensions, two factors were examined for both samples by stipulating this condition in the analysis, which can be requested in SPSS or STATA.

FACTOR PATTERN COMPARISON The first level of analysis and interpretation is factor pattern comparison. For a scale or instrument to have cross-cultural equivalence among the comparative groups, it must exhibit the same pattern of factor structure among the groups. Initially, the exploratory factor analysis of the Rosenberg Self-Esteem Scale in Table 4.6 produced a one-factor solution in the Cuban group and two-factor solution in the Vietnamese group based on the cutoff of 1.0 as the required Eigenvalue. A two-factor solution in the Cuban sample was forced in order to demonstrate the differences in the values of factor loadings for the two factors across the Vietnamese and Cuban samples. Additionally, as stated above, previous literature on the Rosenberg Self-Esteem Scale indicated that by design, there were two conceptual dimensions or factors that the scale was meant to capture—positive self-esteem and negative self-esteem. This theoretical frame for the scale also provided a good rationale for

57

Assessing and Testing Cross-Cultural Measurement Equivalence

examining the scale as a two-factor solution for both samples. Results from the comparison of factor patterns suggest that there is theoretical but not statistical equivalence in the factor pattern of the 10 self-esteem items between Vietnamese and Cubans.

FACTOR LOADING COMPARISON Once the researchers have examined equivalence in the factor pattern of items, the next step is to compare the equivalence of factor loadings. The “Difference” column in Table 4.6 suggest that there are two items, “Not much to be proud” and “Feel useless,” which have differences in factor loadings that are greater than 0.05 between the samples. The item “Not much to be proud” has a substantially larger factor loading in the Vietnamese sample (λ = 0.55) compared to the Cuban sample (λ = 0.38), which suggests that this contributes more to the measurement of negative self-esteem for Vietnamese compared to Cubans. The item “Feel useless” has a larger factor loading for Cubans (λ = 0.74) compared to Vietnamese (λ = 0.64), which suggests that this item reflects negative self-esteem more for Cuban youth.

CONFIRMATORY FACTOR ANALYSIS Both EFA and CFA can be used to evaluate the factor structure of a set of observed items in terms of their relationship with their respective factors or latent constructs. More specifically, both EFA and CFA focus on the relationship between the observed items of a scale or index and their underlying latent factors or constructs. This relationship indicates how well an abstract concept is measured by its observed items or indicators. Therefore, the strength and magnitude of the “factor loadings” are of primary importance. In EFA, all observed items are assumed to load on all possible factors or latent constructs. In CFA, the researchers specify which observed items should be loaded onto a predetermined number of factors or latent constructs. Additionally, in EFA, factor loadings and structures are not based on any tests of significance. At best, researchers can only eyeball the values of the factor loadings and the factor patterns between the comparative groups and speculate some degree of equivalency based on results from the data.

57

58

58

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

Since the writing of the first edition of this book, there has been a proliferation of statistical software packages that allow researchers to perform confirmatory factor analysis in structural equation modeling (SEM), such as AMOS, EQS, LISREL, STATA, and MPLUS. These software packages can be used to test different types of hypotheses concerning the measurement properties of a scale or index, along with hypotheses concerning cross-cultural equivalence of a research instrument. There has been a general consensus that multigroup or multisample CFA provides a strong approach in evaluating cross-cultural equivalence of measurement properties between different population groups (He, Merz, & Alden, 2008; Steenkamp & Baumgartner, 1998). Researchers have proposed different procedural steps in the testing of measurement equivalence hypotheses, including the equivalence of the covariance matrices of the observed indicators and the equivalence of factor means among groups (Vandenberg & Lance, 2000). Joreskog and Sorbom (2001) suggested that measurement equivalence can be assessed across groups by testing the invariance of factor structures with five general hypotheses: Ho. A. Testing the hypothesis of equivalence of covariance matrices of observed indicators of a scale or an instrument between comparative groups Ho. B. Testing the hypothesis of equivalence of factor patterns of observed indicators across groups Ho. C. Testing the hypothesis of equivalence of factor loadings of observed indicators on their respective factors across groups Ho. D. Testing the hypothesis of equivalence of measurement errors of observed indicators across groups Ho. E. Testing the hypothesis of equivalence of factor variances and covariance across groups It is rare that one can find an absolute equivalence of all measurement properties across the comparative groups. Steenkamp and Baumgartner (1988) suggested that invariance, or statistical similarity, in factor patterns and factor loadings are sufficient to determine whether a construct can be measured across different cultural, national, or racial groups. However, if the purpose is to compare the latent means of a construct across groups, then one needs to establish

59

Assessing and Testing Cross-Cultural Measurement Equivalence

both metric equivalence (same-factor loadings) and scalar invariance (same-item intercepts). This chapter emphasizes the importance of having equivalence of factor pattern and factor loadings as the fundamental steps in the construction of a cross-cultural research instrument (Hoelter, 1983). Under some theoretical assumptions, equivalence in the factor pattern and factor loadings may be sufficient in comparing relationships among the variables. For example, if a researcher plans to investigate whether social support associates with life satisfaction in two different racial elderly populations, it is important for the researcher to use a social support scale and a life satisfaction scale that have similar factor patterns and factor loadings between the populations of interest.

BASIC COMPONENTS OF A CONFIRMATORY FACTOR ANALYSIS (CFA) MODEL There are some basic symbols and notations that are useful for understanding how to perform and interpret CFA analysis. Explanations and interpretations of these symbols and notations are discussed throughout the examples. There are two measurement models, as illustrated in the following equations. Measurement Model for Observed X variables: X = ΛXξ + δ Measurement Model for Observed Y variables: Y = Λyη + ε In the aforementioned equations, X and Y are observed variables of selected latent constructs. ΛX(Lambda X) and ΛY(Lambda Y) are factor loadings of X and Y variables on their respective latent variables, which represent how strongly they are explained by the latent factor. ξ (Ksi) and η (Eta) are latent variables of X and Y, and are the independent variables in the equations. δ (Delta) and ε (Epsilon) are measurement errors of X and Y variables, which represent how much inaccuracy there may be in how well the observed items are explained by the latent variables or factors.

59

60

60

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

ASSUMPTIONS There are three basic assumptions for a measurement model. These assumptions are essentially identical to those found in ordinary least squares (OLS) regression analysis. 1. The measurement errors (δ) of observed variables X are uncorrelated with the latent variable (ξ). 2. The measurement errors (ε) of observed variables Y are uncorrelated with the latent variable (η). 3. The measurement errors (δ) of observed variables X are uncorrelated with the measurement errors (ε) of observed variables Y. Let’s assume that a researcher hypothesizes that self-esteem is a risk factor of depression among immigrant youth. Self-esteem is measured by a scale of four items and depression by a scale of three items. The following illustrates the causal relationships between self-esteem and depression and the respective notations expressed in the aforementioned equations. ξ1 is the latent variable of self-esteem. X1 to X4 are four observed items of self-esteem. λ1 to λ4 are the relationships (factor loadings) between job self- esteem and its four observed indicators. δ1 to δ4 are the measurement errors of observed items X. η1 is the latent variable of depression. Y1 to Y3 are three observed items of depression. λ1 to λ3 are the relationships (factor loadings) between depression and its three observed indicators. ε1 to ε3 are the measurements of observed items Y. γ11 (gamma) is the path coefficient of depression on self-esteem.

PREPARING INPUT DATA FOR LISREL ANALYSIS Correlations and covariances of the observed variables are commonly used as input data for LISREL. There are different ways to generate correlation and covariance matrices of observed variables for LISREL

61

Assessing and Testing Cross-Cultural Measurement Equivalence

analysis. LISREL’s companion software PRELIS can be used to generate different types of correlations and covariances. Specifically, one can use PRELIS to generate the following matrices for analysis: CM MM AM KM PM OM

(covariance matrix) (moment matrix) (augmented moment matrix) (correlation matrix of Pearson correlations) (polychoric correlation matrix) (canonical correlation matrix of Optimal scores)

Researchers often use either a Pearson correlation matrix or a covariance matrix as input data for LISREL analysis. The polychoric matrix works well for data with dichotomous items, or in cases where the data are highly skewed or kurtotic (Joreskog & Moustaki, 2001).

PREPARING DATA VIA SPSS OR STATA There are a few rules that data analysts should always follow in preparing data for a cross-cultural analysis. 1. Reviewing and Verifying Data. Data analysts should begin the process of data analysis with a careful review and verification of the coding of the measurement items across groups. All items of a scale or index should be coded the same across the comparative groups. Using the DESCRIPTIVES procedure in SPSS or SUMMARIZE and TABULATE in STATA, one can generate the information concerning descriptive statistics for each item of a scale or an instrument. 2. Missing Data. Analysts should check for missing data in each item. Missing data in any item of a scale will reduce the sample size. If missing data is a concern, then analysts can use different methods of missing data treatments to handle the data as suggested in the literature (Kim & Curry, 1977; Little & Rubin, 1987; McDonald, Thurton, & Nelson, 2000). Due to the complexities of this issue, treatment of missing data is beyond the scope of this book.

61

62

62

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

3. Direction of Correlations. Correlation of the items of a scale should be positive. All items of a scale should correlate in the same direction. If there is a negative correlation between any items, then analysts should verify the coding of these items to correct the problem.

PERFORMING EXPLORATORY FACTOR ANALYSIS (EFA) Once the data are checked, cleaned, and verified, analysts can perform an EFA to identify the underlying dimensions or factors of the observed items. Examples of EFA were provided in the previous section. In using the SPSS or STATA for factor analysis, analysts should generate coefficients and significance levels for the correlation matrix. This information is useful for the initial screening of the quality of the scale items. Items that have poor correlation should be excluded from the analysis. Analysts should select PRINCIPAL AXIS FACTORING as the preferred method for extraction, or MAXIMUM LIKELIHOOD if the sample is large. For purposes of rotating the extracted factors, VARIMAX is likely the best choice because this procedure will generate factors underlying the observed items, which are orthogonal or represent independent factors or dimensions. This is important because dimensions of a scale are most easily interpreted when they are conceptually distinct. The key result that analysts should pay attention to is the ROTATED FACTOR MATRIX. Factor loadings should range between 0.30 and .95. If the loadings are too low, this suggests that the extracted factor or dimension accounts for too little from the item. If the factor loadings are too high, this means that the extracted factor accounts for almost all the item, which suggests that there is no difference from the factor to the observed item, making the factor analytic process redundant. Naming and defining the factors generated from EFA can sometimes be difficult and confusing. Reviewing the related literature can provide researchers useful insight in naming and defining the factors underlying the observed items. For example, scales are sometimes designed to be congeneric (single factor) or have two or more dimensions or factors. In addition, consulting with experts in the field will also help researchers interpret the meanings of the factors.

63

Assessing and Testing Cross-Cultural Measurement Equivalence

GENERATING CORRELATION AND COVARIANCE MATRICES Once the dimensions or factor structures are identified in each comparative group, analysts can begin to generate data for CFA via LISREL. We recommend that analysts generate the appropriate matrix or matrices in Prelis and save each one as a separate file, which can then be called up using syntax in LISREL for use in the appropriate analysis. In order to aid analysts in generating the appropriate matrix of correlations, the following is a step-by-step guide to this process. Open LISREL 9.0. 1. Go to File, ➔ Import Data… 2. Browse and locate the SPSS file (.sav) or STATA file (.dta) on your drive. The program defaults to the LISREL folder. 3. Locate the file, and make sure you click on “Files of type” and change it to “SPSS Data File(*.sav) or “STATA Data File(*.dta). Open your data. 4. Go to Data, ➔ Define Variables. You can clean up many of your variables this way. 5. Go to Statistics ➔ Output options. Click on “Save to file:”, and put in a name for the covariance matrix you would like to generate. 6. Navigate to the folder where you saved your data to locate the Prelis Data file you previously saved. 7. In this same folder, you will find your saved covariance matrix. You can name this file using the filename directly from your LISREL syntax script, using the following syntax:matrix from file filename

Generating a Polychoric and Asymptotic Covariance Matrix 1. In order to generate a polychoric correlation matrix, go to Statistics ➔ Output options. 2. Choose “Correlations” in the Moment Matrix, click on “Save to file:”, and put in a name for the polychoric correlation matrix you would like to generate. Click on “Save to file” for Asymptotic Covariance Matrix to save this matrix as well, and to inform Prelis that you are generating a polychoric matrix.

63

64

64

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

3. The polychoric correlation matrix can then be called up in LISREL syntax, as follows:Correlation matrix from file polychoricAsymptotic covariance matrix from file asymptotic In the above example, polychoric represents the name of the file you used for your polychoric matrix, and asymptotic represents the name of the asymptotic covariance matrix you used for your asymptotic covariance matrix.

LISREL CONFIRMATORY FACTOR ANALYSIS As mentioned earlier, 10 items of the Rosenberg Self-Esteem Scale were selected from the Children of Immigrants Longitudinal Study (CILS) dataset for examples of LISREL CFA and cross-cultural equivalence analyses of the research instruments. In the previous section, EFA analyses were performed on these 10 Self-Esteem items for both Vietnamese and Cuban youths using the CILS data. Polychoric matrices are used as input data for the following LISREL CFAs. In this example, the polychoric matrices are generated in Prelis and then used in the LISREL syntax–only file. LISREL 9 (Joreskog & Sorbom, 2004) is used for this analysis. The CFA model has the following information: 1. The measurement model has two latent constructs (factors): Positiveself-esteem (PosSE) and Negativeself-esteem (NegSE). 2. The negative self-esteem construct has five items, and the positive self-esteem has five items, as indicated in previous EFAs. The observed indicators of the two latent constructs are presented in the squared boxes. The arrows from the latent constructs to their respective observed items indicate that they are “reflective indicators” (Edwards & Bagozzi, 2000; Bollen1989). Specifically, the latent constructs are the causes of the observed indicators. The factor loadings indicate the causal relationships between a latent construct and its respective observed indicators or items.

65

Assessing and Testing Cross-Cultural Measurement Equivalence

3. Each observed indicator in the model also has an error term. The error terms are the causes of the observed indicators. If the latent constructs have perfect causal relationships with their respective observed items, the error terms would be zero. The following LISREL 9 Command File 1 is a simple CFA using a polychoric matrix. Command File 1. Title SEViet 2F Observed Variables worth goodqual failure wellothe notproud possatt satissel moreresp useless nogood Correlation matrix from file vietpoly Asymptotic covariance matrix from file vietasym Sample size 215 Latent Variables PosSE NegSE Relationships worth goodqual wellothe possatt satissel = PosSE failure notproud moreresp useless nogood = NegSE Number of Decimals = 3 Print Residuals Path Diagram End of Problem

In Command File 1, the first line is the title, “SEViet 2F.” Analysts are free to use titles that are easy to remember and related to the goal of an analysis. The title of this syntax is named this way to show that this is an analysis of Self-Esteem for the Vietnamese sample, specifying two factors. The next section of Command File 1 is the DATA specifications of the analysis. The syntax identifies the Observed Variables as the 10 items in the scale, “worth goodqual failure wellothe notproud possatt satissel moreresp useless nogood.” Note also that in this example, the file “vietpoly” was identified as the correlation matrix, in this case the polychoric matrix, and the file “vietasym” was identified as the

65

66

66

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

asymptotic covariance matrix. The sample size was identified as 215 in this analysis. The next section is the specification of the latent variables, and the relationships of the observed variables to these latent constructs. Again, the labels should be meaningful and related to the contents of the items. In this example, the latent variables are named PosSE, to mean Positive Self- Esteem, and NegSE, to mean Negative Self- Esteem. The items “worth goodqual wellothe possatt satissel” are assigned to positive self- esteem, while the items “failure notproud moreresp useless nogood” are assigned to negative self-esteem. The next section is the output section. Using this Simplis syntax, LISREL will produce the standard errors for the estimates (SEs), the completed standardized solutions for the estimates (SCs), and the t-values as tests of statistical significance for the estimates (TVs). Analysists can change the number of decimals for results, and the command “Path Diagram” requests that LISREL produce a path diagram graphic, with parameter estimates and significances on all the paths. The LISREL CFA Command File for the sample of Cuban youths is the same as the command file of the sample of older African Americans with the exception of the title line, sample size (NO).

RUNNING LISREL WINDOW APPLICATION Command File 1 can be entered directly to the LISREL Window Application. Once the LISREL program is activated, analysts can select the FILE menu and choose the NEW command that will appear on the screen, and begin to enter the command file. Once the typing is completed, analysts should save it, then go to the RUN menu to run the analysis. There are two RUN icons, one for LISREL and one for PRELIS. For all examples in this book, Run LISREL is the choice. Unless the specified model has very poor fit, the program will execute the input file quickly, and the results are presented on the computer screen in a matter of seconds. By including the “Path Diagram” syntax, LISREL will produce a path diagram with parameter estimates for all relationships. The program reports the original input program, the sample size, and the correlation matrix used for the estimation model (See

67

Assessing and Testing Cross-Cultural Measurement Equivalence 0.66

worth

0.33

goodqual

0.55

failure

0.57

wellothe

0.48

notproud

0.50

possatt

0.54

satissel

0.59 0.82 0.65

PosSE

1.00

0.67 0.71 0.68 0.72

0.54

NegSE

1.00

0.58 0.61

0.66

moreresp

0.62

useless

0.44

nogood

0.75

Chi–Square = 67.73, df = 34, P–value = 0.00051, RMSEA = 0.068

Figure 4.1 Path Diagram and Results of a Confirmatory Factor Analysis of Two Dimensions of Self-Esteem.

Appendix 4.1). The following is a sample of the results from the LISREL analyses from the path diagram in Figure 4.1. These full results (See Appendix 4.1) include three important types of information: the factor loadings or causal relationships between the latent constructs and their respective observed items, the variances and covariances of the latent constructs, and the measurement errors of the observed items. The command file produces separate “Measurement Equations,” which report unstandardized factor loadings, the SEs enclosed in the parentheses, and the t-values below them, as the test statistical significance for the estimates. One can test the hypothesis that a particular observed item has a statistically significant relationship with its respective latent construct. Let’s look at the factor loading of the item labeled worth—“I am a person of worth”—on its latent construct of PosSE, or Positive Self-Esteem (see Table 4.7). This item has an unstandardized factor loading of 0.587, an SE of 0.0768, and a t-value of 7.639. The t-value is the ratio of the factor loading over its SE (0.587/0.0768 = 7.639). The t-value in parentheses is slightly different from the output because of the rounding values. The

67

68

68

Developing Cross-Cultural Measurement in Social Work Research and Evaluation Table 4.7 Measurement Equations and Estimates for the Vietnamese Sample worth =

goodqual =

failure =

wellothe =

notproud =

possatt =

satissel =

moreresp =

useless =

nogood =

0.587*PosSE, (0.0768) 7.639 0.817*PosSE, (0.0530) 15.407 0.674*NegSE, (0.0691) 9.749 0.652*PosSE, (0.0665) 9.806 0.722*NegSE, (0.0597) 12.097 0.707*PosSE, (0.0613) 11.544 0.677*PosSE, (0.0642) 10.545 0.582*NegSE, (0.0656) 8.860 0.614*NegSE, (0.0708) 8.664 0.748*NegSE, (0.0542) 13.814

Errorvar.= 0.655 , (0.113) 5.791 Errorvar.= 0.333 , (0.110) 3.013 Errorvar.= 0.546 , (0.116) 4.724 Errorvar.= 0.575 , (0.110) 5.204 Errorvar.= 0.478 , (0.110) 4.345 Errorvar.= 0.500 , (0.110) 4.531 Errorvar.= 0.541 , (0.111) 4.892 Errorvar.= 0.662 , (0.102) 6.457 Errorvar.= 0.624 , (0.111) 5.641 Errorvar.= 0.440 , (0.106) 4.150

R² = 0.345

R² = 0.667

R² = 0.454

R² = 0.425

R² = 0.522

R² = 0.500

R² = 0.459

R² = 0.338

R² = 0.376

R² = 0.560

LISREL output does not produce an exact level of significance for the values. However, by convention, a t-value of 2.00 or greater is significant at 0.05 or smaller. The results of the Errorvar, or measurement errors of observed items, are interpreted similarly. It is worth noting that there is now a plethora of online sources available for users to generate p-values with t-values and degrees of freedom, if that is preferable.

69

Assessing and Testing Cross-Cultural Measurement Equivalence

GOODNESS-OF-FIT STATISTICS Although LISREL analysis offers more than three dozen measures of fit, only a few have meaningful application for the interpretation of how well the model fits the data. Table 4.8 presents a few useful goodness-of- fit measures that are frequently reported in the literature. The chi-square fit function test is probably the most reported statistic in the structural equation model. This collection of tests examines the overall fit of the proposed model to the fully adjusted model. All chi-square function tests used in SEM analysis should have a probability value greater than 0.05 for the model to fit the data well. In this example, three chi-square fit function tests are presented: (1) the Minimum Fit Function Chi- Square, (2) the Normal Theory Weighted Least Squares Chi-Square, and (3) the Satorra-Bentler Scaled Chi-Square. The Minimum Fit Function Chi-Square is best for data that have normally distributed errors, while the Normal Theory Weighted Least Squares Chi-Square and the Satorra-Bentler Scaled Chi-Square are better statistical tests with non-normal data. The Satorra-Bentler Scaled Chi-Square Table 4.8 Select Goodness-of-Fit Statistics for the Vietnamese Sample Degrees of Freedom = 34 Minimum Fit Function Chi-Square = 148.965 (P = 0.00) Normal Theory Weighted Least Squares Chi-Square = 148.242 (P = 0.00) Satorra-Bentler Scaled Chi-Square = 67.735 (P = 0.000510) Root Mean Square Error of Approximation (RMSEA) = 0.0681 90% Confidence Interval for RMSEA = (0.0440; 0.0917) P-Value for Test of Close Fit (RMSEA .05) for the analysts to draw a conclusion that there is cross-cultural equivalence in all elements of the factor structures of a scale. (p.102) All other measures of goodness-of-fit statistics are for information only (Joreskog & Sorbom, 2001). The results confirm that Ho. A. “Four selected items of the Rosenberg Self-Esteem scale exhibit an equivalence of covariance structures between Vietnamese and Cuban youth” is supported, given that the value of the chi-square and its associated probability value is less than 0.05. No further tests of equivalence are needed in this case (Satorra-Bentler Scaled Chi-Square = 12.27 [P = 0.27]). In the example, there was to need to further test subsequent equivalence hypotheses. However, the following are sample syntax that can be used to test them: Ho. B. The factor pattern of the observed items are statistically equivalent between groups. Tran (2009) provided several examples to test all five hypotheses of measurement equivalence. EQUIVALENCE OF FACTOR PATTERN: H0 B Group 1: Cuban DA NI = 4 NO = 583 NG = 2 MA = PM

81

82

82

Developing Cross-Cultural Measurement in Social Work Research and Evaluation LA wellothe notproud moreresp nogood PM FI = cuban4poly AC FI = cuban4asym MO NX = 4 NK = 1 TD = SY LK Selfesteem FR LX (2, 1) LX (3,1) LX (4, 1) VA 1 LX 1 1 Group 2: Vietnamese DA NI = 4 NO = 215 MA = PM LA wellothe notproud moreresp nogood PM FI = viet4poly AC FI = viet4asym MO LX = PS PD OUT

Ho. C. The factor loadings of the observed items are statistically equivalent between groups. EQUIVALENCE OF FACTOR LOADINGS: H0 C Group 1: Cuban DA NI = 4 NO = 583 NG = 2 MA = PM LA wellothe notproud moreresp nogood PM FI = cuban4poly AC FI = cuban4asym MO NX = 4 NK = 1 TD = SY LK Selfesteem FR LX (2, 1) LX (3,1) LX (4, 1) VA 1 LX 1 1 Group 2: Vietnamese DA NI = 4 NO = 215 MA = PM LA wellothe notproud moreresp nogood PM FI = viet4poly AC FI = viet4asym

83

Assessing and Testing Cross-Cultural Measurement Equivalence MO LX = IN PD OUT

Ho. D. The measurement errors of the observed items are statistically equivalent between groups. EQUIVALENCE OF MEASUREMENT ERRORS: H0 D Group 1: Cuban DA NI = 4 NO = 583 NG = 2 MA = PM LA wellothe notproud moreresp nogood PM FI = cuban4poly AC FI = cuban4asym MO NX = 4 NK = 1 TD = SY LK Selfesteem FR LX (2, 1) LX (3,1) LX (4, 1) VA 1 LX 1 1 Group 2: Vietnamese DA NI = 4 NO = 215 MA = PM LA wellothe notproud moreresp nogood PM FI = viet4poly AC FI = viet4asym MO LX = IN TD = IN PD OUT

Ho. E. The variances and covariances of the latent constructs or factors are statistically equivalent between groups. EQUIVALENCE OF FACTORIAL EQUIVALENCE: H0 E Group 1: Cuban DA NI = 4 NO = 583 NG = 2 MA = PM LA wellothe notproud moreresp nogood PM FI = cuban4poly AC FI = cuban4asym MO NX = 4 NK = 1 TD = SY LK

83

84

84

Developing Cross-Cultural Measurement in Social Work Research and Evaluation Selfesteem FR LX (2, 1) LX (3,1) LX (4, 1) VA 1 LX 1 1 Group 2: Vietnamese DA NI = 4 NO = 215 MA = PM LA wellothe notproud moreresp nogood PM FI = viet4poly AC FI = viet4asym MO LX = IN TD = IN PH = IN PD OUT

REVISING THE SPECIFICATIONS OF THE CONFIRMATORY FACTOR ANALYSIS MODEL It should be noted that any set of observed items can have more than one measurement model as long as each latent factor has a minimum number of two observed items (Jan-Benedict & Baumgartner, 1998). These 10 items of the Rosenberg Self-Esteem Scale can have no more than five latent factors. In revising the measurement specifications of the observed items, researchers should be guided by theory, clinical observations, and the pattern of the inter-item correlations among the observed variables. Because the scale was theorized to capture positive self-esteem and negative self-esteem, generating more than two factors does not make theoretical sense. This chapter offers several hands-on examples of using SEM CFA and multisample analyses of factor structures as tools to evaluate cross- cultural equivalence in the measurement properties of the research instruments, whether they are newly developed or translated from a source language. These tools should be used to examine all scale instruments to validate their use when comparing different populations. In Chapter 5, we address the basic concepts of Item Response Theory (IRT) approaches and their applications in cross-cultural scale development and psychometrics evaluation. We present our results though a demonstration in the use of the IRTPRO® software for IRT (Cai, Thissen, & du Toi, 2011).

85

Assessing and Testing Cross-Cultural Measurement Equivalence

APPENDIX 4.1: DETAILED RESULTS FROM THE PATH DIAGRAM PRESENTED IN FIGURE 4.1 Sample Size = 215 Self-Esteem Correlation Matrix

worth goodqual failure wellothe notproud possatt satissel moreresp useless nogood

Worth

Goodqual Failure Wellothe Notproud Possatt

1.000 0.570 0.169 0.356 0.341 0.335 0.364 0.187 0.053 0.087

1.000 0.293 0.587 0.424 0.512 0.506 0.259 0.270 0.284

1.000 0.275 0.530 0.311 0.297 0.351 0.396 0.487

1.000 0.241 0.484 0.378 0.140 –0.050 0.157

1.000 0.365 0.321 0.508 0.340 0.486

1.000 0.618 0.175 0.201 0.356

Correlation Matrix

satissel moreresp useless nogood

Satissel

Moreresp Useless

Nogood

1.000 0.215 0.175 0.311

1.000 0.332 0.409

1.000

1.000 0.613

SEViet 2F Number of Iterations = 10 Note that the program will state how many iterations it took for the model to converge. This is important because generally speaking, a better model will converge more quickly.

85

86

86

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

LISREL Estimates (Robust Maximum Likelihood) Measurement Equations worth = 0.587*PosSE, Errorvar. = 0.655 , R² = 0.345 (0.0768) (0.113) 7.639 5.791 goodqual = 0.817*PosSE, Errorvar. = 0.333 , R² = 0.667 (0.0530) (0.110) 15.407 3.013 failure = 0.674*NegSE, Errorvar. = 0.546 , R² = 0.454 (0.0691) (0.116) 9.749 4.724 wellothe = 0.652*PosSE, Errorvar .= 0.575 , R² = 0.425 (0.0665) (0.110) 9.806 5.204 notproud = 0.722*NegSE, Errorvar. = 0.478 , R² = 0.522 (0.0597) (0.110) 12.097 4.345 possatt = 0.707*PosSE, Errorvar. = 0.500 , R² = 0.500 (0.0613) (0.110) 11.544 4.531 satissel = 0.677*PosSE, Errorvar. = 0.541 , R² = 0.459 (0.0642) (0.111) 10.545 4.892 moreresp = 0.582*NegSE, Errorvar. = 0.662 , R² = 0.338 (0.0656) (0.102) 8.860 6.457 useless = 0.614*NegSE, Errorvar. = 0.624 , R² = 0.376 (0.0708) (0.111) 8.664 5.641 nogood = 0.748*NegSE, Errorvar. = 0.440 , R² = 0.560 (0.0542) (0.106) 13.814 4.150 Correlation Matrix of Independent Variables PosSE NegSE -------- -------- PosSE 1.000 NegSE 0.538 1.000 (0.083) 6.489

87

Assessing and Testing Cross-Cultural Measurement Equivalence

SEViet 2F Fitted Covariance Matrix

worth goodqual failure wellothe notproud possatt satissel moreresp useless nogood

Worth Goodqual Failure

Wellothe Notproud Possatt

1.000 0.480 0.213 0.383 0.228 0.415 0.398 0.184 0.194 0.236

1.000 0.254 0.461 0.442 0.204 0.215 0.263

1.000 0.296 0.533 0.318 0.578 0.553 0.256 0.270 0.329

1.000 0.237 0.487 0.256 0.246 0.392 0.413 0.504

1.000 0.275 0.263 0.420 0.443 0.541

1.000 0.479 0.221 0.233 0.285

Fitted Covariance Matrix Satissel Moreresp Useless Nogood satissel moreresp useless nogood

1.000 0.212 0.224 0.273

1.000 0.357 0.435

1.000 0.459

1.000

Fitted Residuals Worth Goodqual Failure Wellothe Notproud Possatt worth goodqual failure wellothe notproud possatt satissel moreresp useless nogood

0.000 0.090 – 0.044 – 0.026 0.113 – 0.080 –0.034 0.004 –0.140 –0.149

0.000 –0.003 0.054 0.107 – 0.066 – 0.048 0.003 0.000 – 0.045

0.000 0.039 0.044 0.055 0.051 –0.041 –0.018 –0.017

0.000 –0.012 0.023 –0.063 – 0.064 –0.265 – 0.105

0.000 0.090 0.058 0.088 –0.103 – 0.055

0.000 0.139 –0.047 –0.033 0.071

87

88

88

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

Fitted Residuals Satissel Moreresp Useless Nogood satissel 0.000 moreresp 0.003 0.000 useless –0.049 – 0.025 nogood 0.038 –0.026

0.000 0.153

0.000

Summary Statistics for Fitted Residuals Smallest Fitted Residual = -0.265 Median Fitted Residual = 0.000 Largest Fitted Residual = 0.153 Stemleaf Plot - 2|7 - 2| - 1|5 - 1|410 - 0|87665555 - 0|44433332221000000000000000 0|2444 0|55567999 1|114 1|5

Standardized Residuals

worth goodqual failure wellothe notproud possatt satissel moreresp useless nogood

Worth

GoodqualFailure

Wellothe Notproud Possatt

— 4.535 – 0.606 –0.433 1.814 –1.503 – 0.807 0.051 –1.882 –2.153

— – 0.049 — 2.373 –5.548 — 0.059 0.003 –0.849

— –0.164 0.489 –1.237 –0.901 –3.017 –1.401

— 0.520 3.808 0.807 0.860 –1.018 –0.459 —

1.449 0.958 7.214 –3.023 —

— — –0.624 –0.432 1.111

89

Assessing and Testing Cross-Cultural Measurement Equivalence

Standardized Residuals

satissel moreresp useless nogood

Satissel

Moreresp Useless

Nogood

— 0.041 –0.694 0.639

— –0.645 –1.085

—

— —

Summary Statistics for Standardized Residuals Smallest Standardized Residual = –5.548 Median Standardized Residual = 0.000 Largest Standardized Residual = 7.214 Stemleaf Plot - 4|5 - 2|002 - 0|954210988766654420000000000000000000 0|11556890148 2|48 4|5 6|2 Largest Negative Standardized Residuals Residual for possatt and goodqual –5.548 Residual for useless and wellothe –3.017 Residual for useless and notproud –3.023 Largest Positive Standardized Residuals Residual for goodqual and worth 4.535 Residual for notproud and failure 3.808 Residual for moreresp and notproud 7.214

The Modification Indices Suggest to Add the Path to from Decrease in Chi-Square New Estimate notproud PosSE

8.8

0.23

The Modification Indices Suggest to Add an Error Covariance Between and Decrease in Chi-Square New Estimate notproud worth

9.0

0.13

89

90

90

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

notproud possatt satissel satissel moreresp useless useless nogood nogood

worth worth wellothe possatt notproud wellothe notproud worth useless

9.0 9.2 8.2 84.3 15.9 20.0 32.2 9.4 168.6

Time used: 0.031 Seconds

0.13 –0.18 –0.19 0.77 0.27 –0.21 –0.43 –0.13 1.41

91

5

Item Response Theory in Cross-Cultural Measurement

The aims of this chapter are to provide a brief overview of IRT, demonstrate basic features of IRT using existing data, and walk the reader through key steps in conducting IRT analysis using IRTPRO®. The end-reader should walk away with a broad understanding of the central concepts of IRT, the ability to interpret data generated from IRT analysis to inform tool development and refinement, and skills to run basic IRT analysis. For students, researchers, and clinicians who wish to gain further technical details, several existing resources are available (Hambleton, Swaminathan, & Rogers, 1991; Holland & Wainer, 1993; van der Linden & Hambleton, 1997).

INTRODUCTION Since the publication of classic measurement textbooks by Gullikson (1950) and Nunnally (1967), Classical Test Theory (CTT) has enjoyed widespread use and is the basis for most of the measures used in 91

92

92

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

healthcare and social work research. However, a quiet movement occurred when Lord and Novick (1968) introduced an alternative model-based theory called Item Response Theory (IRT) that was not only more plausible, but also had the potential to solve practical measurement problems not possible through CTT (Embretson, 1996; Gilles et al., 2005; Hambleton, Swaminathan, & Rogers, 1991). From this movement, considerable advancements in measurement occurred such as the use of computer adaptive testing (CAT), mainly in the field of education by highly trained psychometricians. IRT has also increasingly been used to develop, shorten, and refine psychosocial instruments (Hibbard, Stockard, Mahoney, & Tusler, 2004; Nguyen, Paasche- Orlow, Kim, Han, & Chan, 2015). The use of IRT applications among a wider audience, including healthcare researchers, has been hampered by the complex mathematics used in its modeling as well as the “clunky” software programs used to run the analysis (Nunnelly & Bernstein, 1994). However, several recent developments have led to a reinvigorated interest in IRT. First, in response to lengthy measures that contribute to participant burden and the challenge of comparing results across studies that use different measures, the National Institutes of Health (NIH) has engaged in an ambitious multi-institute roadmap initiative to re-engineer the clinical infrastructure (Cella et al., 2007). The Patient Reported Outcome Measurement Information System (PROMIS) initiative aims to build and validate item banks that measure key symptoms and health concepts applicable to a range of health conditions. These item banks have been used in CAT applications, and the results are promising (Cella et al., 2010; Matseliukh, Polishchuk, Lutchenko, Bambura, & Kopiyko, 2005). A second critical factor that will likely lead to increased use of IRT is the improvement in software programs and technology. Statistical software programs with a Windows® based interface such as IRTPRO® are now available and have made data analysis more accessible to a wider audience. Lastly, the ubiquity of mobile devices such as smart phones and tablet PCs will likely facilitate data collection, given the need for computers in CAT applications. As a result, a new generation of patient-reported health outcome measures has and will continue to be developed based on the principles of IRT. Therefore the time for IRT has come, and for us to learn its vocabulary.

93

Item Response Theory in Cross-Cultural Measurement

BROAD OVERVIEW OF IRT IRT is a modern measurement theory that, as its name implies, focuses mainly at the item level as opposed to the test level. The underlying principle of IRT is that a relationship exists between an individual’s ability and how they respond to items on a test. Frederic Lord, the father of modern measurement, noted that an individual’s ability score is not equivalent to the concept of observed scores used in CTT (Lord, 1953). In particular, he asserts that an individual’s ability score is test independent, as opposed to observed scores which are test dependent. His main premise is that when individuals answer items on a test, they have an underlying ability level in relation to the concept being measured, and this ability is test independent. However, an individual’s observed score will always depend on the particular selection of items from the universe or domain of possible items that are used to assess a concept. For example, if an individual takes an easy “self-esteem” test, their observed score will likely be higher than if they took a hard “self-esteem” test. Regardless of whichever test they take, though, Lord argues that their underlying “self-esteem” remains constant. That basic example helped him demonstrate that there is something more fundamental operating than the concept of observed score. This led him and others to develop psychometric theories and models that could be used to guide the measurement of individuals’ ability scores that are independent of the particular choice of items used on a test, and is the birthplace of IRT. IRT, more specifically, is a general statistical theory that uses mathematical models to describe the relationship between an individual’s ability and how he or she responds to an item. In IRT, this relationship can be described by the item characteristic curve (ICC). In situations where the response options are binary (i.e., correct/incorrect), the 2 parameter logistic (2PL) model is often used. The 2PL model yields a smooth S-shaped ICC for each item on a test, which can be interpreted by its respective location (b) and slope (a) parameters (Figure 5.1). Understanding the ICC provides the basic building block for IRT (Baker, 2001). The location (b) parameter, also referred to as the difficulty parameter in various texts, is used to describe the relative “difficulty” or location of an item along an ability scale, denoted by the Greek letter theta (θ). The b parameter is defined as the point along the ability scale,

93

94

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

P (θ)

94

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

a

b –3

–2

–1

0

1

2

3

Theta (θ)

Figure 5.1 Item Characteristic Curve (ICC).

where the probability of endorsing, or getting an item correct, is 50%, P(θ) = 0.50. This can be visualized by drawing a horizontal line from the position along the y-a xis where the probability of endorsing an item is 0.50, then drawing a vertical line straight down to the ability scale (x- axis) (Figure 5.1). The exact point along the ability scale is the numeric value of the b parameter. Item difficulty or location can be interpreted relative to the midpoint of the ability scale, which is centered at zero, has a unit of one, and ranges from negative infinity to positive infinity (however, the θ scale is usually depicted between –3 to + 3). If b is less than zero, the item is considered “easier” or functions at lower ability levels. If b is greater than zero, the item is considered “harder” or functions at higher ability levels. It is important to highlight here that modern measurement approaches have strong roots in the field of education, where the assessment of “ability” was central. In the field of social work and other allied health disciplines, many psychosocial scales measure latent concepts such as depression, physical functioning, and quality of life, where the assessment of “ability” does not make sense (e.g., depression ability). Rather, the term “trait” is used. So, items with lower b values function at, or target, lower trait levels; and items with higher b values function at higher trait levels of a particular concept. Throughout this chapter and in the existing literature, the terms ability and trait may be used interchangeably. Similarly, the terms location and difficulty are used interchangeably. The slope parameter (a), which is also referred to as the discrimination parameter, is defined as the slope of the ICC at b, or the steepest region of the curve (Figure 5.1). The a parameter describes how well

95

Item Response Theory in Cross-Cultural Measurement

P (θ)

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

–3

1

–1

3

Theta (θ)

Figure 5.2 Two ICCs with Different a Parameter Values.

an item is able to differentiate or discriminate between examinees that have abilities below and above an item’s location. The steepness of the slope at b is used to interpret the discrimination parameter. The steeper the slope, the better the item can discriminate between individuals of differing ability. Figure 5.2 depicts the ICC curve of two items with the same difficulty parameter value but different discrimination parameter values. Of these two items, the item depicted with the solid line has a steeper slope and is better at discriminating between individuals of differing ability. This is because individuals with a trait score of –1 have about an 18% chance of endorsing that item while individuals with a trait score of 1 have an 85% of endorsing that item, whereas for the item depicted with the dashed lines, individuals with a trait score of –1 have about a 30% chance of endorsing the item while individuals with a trait score of 1 have a 70% of endorsing that item. In situations when the item’s response format is polytomous (e.g., Likert type) rather than dichotomous (e.g., correct/incorrect), the a and b parameters are used to create ICCs and categorical response curves (CRCs). The ICCs generated from items with polytomous response formats are difficult to interpret. CRCs provide more meaningful interpretations. Specifically, CRCs plot the probability of endorsing a particular category across the latent trait (theta) scale; one curve is plotted for each response option. Figure 5.3 displays an example of a CRC for an item with 4 response options. Ideally, CRCs should have peaked curves that are ordered hierarchically, as exemplified in Figure 5.3. Based on Figure 5.3, individuals with a trait score of 1 have a near zero percent

95

96

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

P(θ)

96

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

4 1 3 2

–3

–2

–1

0

1

2

3

Theta (θ)

Figure 5.3 Ideal Categorical Response Curves for an Item with Four Response Options.

probability of endorsing response options 1 and 2, about a 10% probability of endorsing response option 3, and about a 90% probability of endorsing response option 4. Figure 5.4 displays a less ideal set of categorical response curves for an item with 4 response options. Based on that figure, individuals with a trait score of 1 have about an 8%, 18%, 25%, and 50% probability of endorsing the 1st, 2nd, 3rd, and 4th response category, respectively. While the probabilities continuously increase based on the order of the response options, the CRCs associated with this item are less discriminating than the one in Figure 5.3. A variety of mathematical IRT models can be used to estimate the location (b) and slope (a) parameters (Bock, 1972; Masters, 1982; Rasch, 1960; Samejima, 1969). While they differ slightly in their parameterization, they all essentially specify a location (b) and slope (a) parameter for each response option (Thissen & Steinberg, 1986). These quantitative parameter estimates are then used to generate ICCs and CRCs (in polytomous situations). The ICCs and CRCs generated can be used as a visual aid when evaluating and/or refining measures. In addition to being used to create ICCs or CRCs, parameter estimates along with individuals’ response vectors (i.e., how individuals respond to a set of items) are used to estimate ability scores (Hambleton et al., 1991; Lord & Novick, 1968). These ability scores (θ), as Lord conceptualized, are not the same as “number-of-items-correct” or summed scores used in CTT.

97

Item Response Theory in Cross-Cultural Measurement

P(θ)

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 –3

1

4

2 3

–2

–1

0 Theta (θ)

1

2

3

Figure 5.4 Less Ideal Categorical Response Curves for an Item with Four Response Options.

To distinguish IRT scores from summed scores, they are often referred to as IRT scores, or trait scores.

PROPERTIES OF IRT MODELS IRT trait scores infer several advantages over traditional summed scores due to a number of unique IRT model properties. One of the most important properties of IRT models is that reliability, or measurement precision, is described as a continuous function conditional on values of θ. In other words, reliability is calculated and uniquely expressed for different ability values rather than assuming constant reliability as in CTT. To calculate measurement precision, each item’s information function curve can be computed, which indicates the range over θ where an item is best at discriminating individuals. Any number of item information curves can be summed together to create a test information function curve that tells how well a set of items is doing at estimating ability over the whole range of ability scores (Figure 5.5). When the amount of information is large at a given ability level, it means that an individual whose true ability is at that level can be estimated with high precision. Conversely, when the amount of information is small at a given ability level, it means that an individual whose

97

98

Developing Cross-Cultural Measurement in Social Work Research and Evaluation 3 2.5 Information

98

2 1.5 1 0.5 0

–3

–2

–1

0

1

2

3

Theta (θ)

Figure 5.5 Example of a Test Information Function Curve.

true ability is at that level cannot be estimated with much precision. Based on the test information function curve in Figure 5.5, information has a maximum at the ability level around -0.25, and is fairly high at the ability range of -0.75 to 0.50. Within this range, ability is estimated with some precision, and outside this range, precision decreases drastically. In a general purpose test, the test information curve would ideally be a horizontal line at a relatively high information value such that all ability levels would be estimated with the same precision (Baker, 2001). Taking the inverse of the square root of the test information curve yields the standard error function (Fisher, 1925), which describes the degree of measurement error associated with different IRT scores. In the most ideal setting, the standard error of measurement function would be associated with low values and take on the shape of a horizontal line (constant precision across a range of abilities). This is rarely ever the case, however, as items that make up a measure often provide varying amounts of information in varying locations along the ability continuum. Therefore, modeling this function and accounting for it in the ability score is a more accurate representation of reality (Embretson, 1996). Invariance of ability and item parameters is another property unique to IRT models. The invariance in ability property implies that ability scores remain constant despite the administration of different items. This is one of the main driving forces behind computer-adaptive testing (CAT), which permits different people to take different sets of items while still making it possible to compare different examinees’

99

Item Response Theory in Cross-Cultural Measurement

P(θ)

ability scores. IRT does this by aligning all items along a common scale, so that any subset of items fielded can be tailored to the respondent but map back to the same scale. In addition, the number of items examinees need to take in order to achieve a reliable estimate of their ability is often much lower than the number of items needed on a traditional test (Gibbons et al., 2011; Ware, Bjorner, & Kosinski, 2000). This is because items are specifically targeted to the level of the individual, so only the most informative items would be fielded to that individual. Item invariance describes the phenomenon in which estimated item parameters are constant across a number of different samples. This allows for unbiased estimates of item parameters to be obtained from unrepresentative samples, so long as the data fits the model (Embretson, 1996). Said another way, the value of the estimated item parameters should be the same when estimated using completely different samples, such as males or females (i.e., the ICC curves should superimpose each other). This is in contrast to CTT, where scale properties are sample- dependent. While strong evidence supports this property (Embretson, 1996; Hambleton & Jones, 1993), it does not hold in all cases. This may be due to poorly written items or items that are interpreted differently by different samples. When item parameters behave differently in subgroups after controlling for ability, an item is considered to have differential item functioning (DIF) (Hambleton et al., 1991). DIF can occur in item location, discrimination, or both. When DIF is present, the resulting item characteristic curves are different for the subgroups. Figure 5.6 demonstrates DIF in item location. In that example, the same item functions at 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Sample 1

–3

–2

–1

Figure 5.6 Example of DIF in Location.

Sample 2

0 Theta (θ)

1

2

3

99

100

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

P(θ)

100

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Sample 1

Sample 2

–3

–1

1

3

Theta

Figure 5.7 Example of DIF in Discrimination.

higher ability levels in sample 2 than sample 1. Figure 5.7 demonstrates DIF in discrimination. In that example, the same item is better at discriminating individuals of differing ability levels in sample 1 than 2. In ideal test settings, particularly in “high stakes” tests, all items should be DIF-free. However, if significant DIF has been identified in certain items, and those items are important to retain in a test, a 2- group model that incorporates the identified DIF may be specified and estimated.

UNDERLYING ASSUMPTIONS In addition to the model properties described earlier, there are several key assumptions in which IRT models are contingent upon: (1) unidimensionality, (2) local independence, (3) monotonicity. The first assumption is that items within a measure demonstrate unidimensionality. The evaluation of this assumption can be done by examining the output of an exploratory factor analysis (EFA). In particular, the relative size of the eigenvalues associated with a principle components analysis can be used (Hamer, David Batty, Seldenrijk, & Kivimaki, 2011; Hamer, Malan, et al., 2011; Pleban et al., 2005). Based on the Kaiser criterion, an eigenvalue greater than one is conventionally used to identify factors or dimensions (Kaiser, 1960).

101

Item Response Theory in Cross-Cultural Measurement 101

While these are helpful rules of thumb, a measure can potentially have several eigenvalues greater than one and still be sufficiently unidimensional for IRT analysis (McLeod, Swygert, & Thissen, 2001). This is applicable when the dominant factor is so strong that the estimation of ability levels is not affected by the presence of minor factors (Embretson & Reise, 2000). For example, a principle components analysis of the 30-item Center for Epidemiological Studies Depression Scale (CES-D) resulted in five eigenvalues greater than one. The first eigenvalue (13.37) was considerably greater than the subsequent four (1.6, 1.5, 1.4, 1.1). Additionally, the majority of the loadings on the first factor were greater than 0.35, with an average of 0.65. Based on those findings, the CES-D was considered sufficiently unidimensional for IRT modeling (Orlando, Sherbourne, & Thissen, 2000). The second assumption of local independence is technically related to the assumption of unidimensionality. Local independence means that when the abilities influencing test performance are held constant, individuals’ responses to any pair of items are statistically independent (Hambleton et al., 1991). This is conceptually equivalent to creating hypothetical intervals, or bins, along the trait (theta) scale that are so small that all individuals that fall into each interval essentially have the same underlying trait, and then evaluating items within each interval to make sure that they are statistically independent. What the local independence assumption implies is that items are not related except for the fact that they measure the same underlying trait or ability. Violations of this assumption can occur when a set of items have a similar stem, rely on similar content, or are presented sequentially (Edelen & Reeve, 2007). This assumption can be evaluated by examining the output of an IRT calibration. In IRTPRO®, an “approximately” standardized LD χ2 statistic is computed to check for violations in local independence, as proposed by Chen & Thissen (1997). The local LD χ2 statistic compares observed and expected frequencies in the two-way marginal tables for each item pair. To help interpret the standardized LD χ2 statistic values, IRTPRO® prints them in red if the observed covariation between responses to a pair of items exceeds what is predicted by the model, and in blue if the observed covariation is less than fitted. A cluster of red values on the table indicates that the corresponding items may measure an unmodeled dimension. Also, since LD χ2 is an “approximately”

102

102

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

standardized statistic, values that exceed 10.0 are considered very large and unexpected (Chen & Thissen, 1997). Another way of assessing violations to the local independence assumption is by evaluating the values of the slope estimates. If a pair of items displays excess covariation, or dependence, they may have very high slope estimates (e.g., > 4) relative to other items in the measure. When this occurs, omitting one of the items and then recalibrating the measure is suggested. If excess dependencies among items pose threats to the local independence assumption, results from a sensitivity analysis excluding those items would yield meaningful differences in parameter estimates. For example, the range of parameter values may change, or the rank ordering of the items according to their slope or location parameters may change (Edelen & Reeve, 2007). It should be noted that the local independence assumption is never precisely met, especially since items often share similar stems or are related to a shared set of information (e.g., a graph or passage). Fortunately, simulated data suggest that item parameters are robust to minor violations of this assumption, particularly when there is a clear dominant first factor (Holloway & Watson, 2002). Monotonicity refers to situations in which the probability of endorsing an item continuously increases as an individual’s trait increases. For instance, referring back to Figure 5.1, an individual with a trait (theta) score of 2 will have a higher probability of endorsing, or getting that item correct, than an individual with a trait score of 1.5, and in turn that individual with a trait score of 1.5 will have a higher probability of endorsing the same item than an individual with a trait score of 1.0. For items with polytomous response formats, monotonicity can be evaluated by visually inspecting the order of the CRCs.

MODEL CHOICE AND FIT As mentioned earlier, several mathematical IRT models exist. It is essential to choose the most appropriate model, and then evaluate its fit. For illustrative purposes, the equation for the 2PL model is given in the equation below. Understanding the computational specifics is not necessary at this introductory level. However, for those interested, Baker

103

Item Response Theory in Cross-Cultural Measurement 103

(2001) does a nice step-by-step overview of the mathematics associated with select IRT models. The decision to use one model over another depends on several factors. P (θ) =

1 1 = −L − a (θ − b ) 1+ e 1+ e

where, e = the constant 2.718 b = the difficulty parameter a = the discrimination parameter L = the logistic deviate, a(θ–b) θ = the ability level P(θ) = the probability of getting an item at the θ ability level correct First, it is important to consider the format of the item response categories. For items with dichotomous response options (i.e., correct/ incorrect), the 1, 2, and 3 parameter logistic models (1PL, 2Pl, 3PL) can be used (Birnbaum, 1968; Lord, 1953). For polytomous (i.e., Likert type) items the choice of model depends on whether the response categories are ordered or nominal. For ordered categorical responses the Graded Response Model (Samejima, 1969), Partial Credit Model (Masters, 1982), Generalized Partial Credit Model (Muraki, 1992), or Rating Scale Model (Andrich, 1978) can be used. For nominal categorical responses, the Nominal Model (Bock, 1972) can be used. Second, a consideration should be made of whether or not the discrimination parameter can realistically be assumed to be constant across a set of items. If this assumption can reasonably be made, using a model from the Rasch family is appropriate (Rasch, 1960). Within the Rasch family of models, the discrimination parameter is constrained to be equal, which requires one less parameter to be estimated. The resulting parsimony makes Rasch models appealing and less computationally involving. While ideal, the data often does not allow this assumption to be reasonably made. Third, considerations should be made about whether or not guessing may occur when individuals respond to items on a test. This can often occur in items that assess cognitive abilities. If guessing is a possibility, the 3PL model is most appropriate. Implicit to all IRT models is an underlying assumption that it is correct. However, IRT models are simply mathematical tools that are

104

104

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

used to approximate reality for the purpose of explaining data. In real life no model perfectly fits the data, or, as the statistician George Box famously phrased it, “All models are wrong, some are useful” (Box & Draper, 1987). The utility of the model depends on the extent to which the data actually fits the model. The question at hand becomes how much model–data misfit is too much? Despite the voluminous literature on IRT modeling, there is no “absolute criteria” to assess overall model or item level fit (Hambleton et al., 1991). However, several approaches have frequently been used. To assess overall model fit, the comparison of nested models (e.g., 2PL vs. 3PL) can be done by examining the difference in –2*logliklihood (–2*LL), which is distributed as chi-square with degrees of freedom equal to the difference in the number of estimated parameters between the two models (Thissen, 1991). When interpreting the nested model comparisons, the null hypothesis is that the more parsimonious model fits better. To assess fit at the item level, several goodness-of-fit statistics have been developed for the Rasch family of models (Anderson, 1973; Rost & von Davier, 1994; Wright & Mead, 1977). For non-Rasch Models, alternative fit indices such as the S - X 2have also been developed (Orlando & Thissen, 2003). When interpreting the item level fit statistic, the null hypothesis is that the item fits well; therefore the goal is not to have significant results, or not to reject the null. It should be mentioned, though, that the “goodness-of-fit” indices should be interpreted with some caution, as there are well known and serious flaws associated with them (Hambleton et al., 1991). Namely, their sensitivity to sample size;—almost any empirical departure from the model under consideration will lead to a rejection of the null if the sample size is sufficiently large. In light of this, several ancillary techniques have been proposed to support model fit. Three of the most common include (1) providing evidence that the model assumptions are met to a reasonable degree by the test data; (2) determining if model properties, such as ability and item invariance, are obtained by the test data; and (3) assessing the accuracy of model predictions using the test data (Hambleton et al., 1991). Techniques to check model assumptions have previously been mentioned. To test for invariance in ability scores, one can compare individuals’ estimated ability scores using different items within a calibrated item set, preferably with items that vary widely in difficulty. If the model fits, a plot of the pairs of ability estimates should support a strong linear

105

Item Response Theory in Cross-Cultural Measurement 105

relationship. To test for item invariance, the estimated item parameters can be compared using to two different groups (e.g., high ability vs. low ability groups, ethnicity, or sex). If the model fits, item calibrations using different groups should yield similar parameter estimates. To test the accuracy of model predictions, the relationship between ability scores and known proximal variables can be tested.

SAMPLE SIZE REQUIREMENTS Similar to procedures used for model fit testing, there are also no set rules about sample size requirements for IRT parameter estimations. However, some general guidelines have been published (Edelen & Reeve, 2007). First, increasing sample sizes are needed for increasingly complex models. For instance, when using Rasch models, sample sizes as small as 100 have been shown to be adequate (Linacre, 1994). For more complex models, there is some debate about sample size requirements. Based on one account, a sample size of 500 is recommended to estimate stable parameters (Tsutakawa & Johnson, 1990). However, others have suggested that sample sizes as little as 200 are adequate (Orlando & Marshall, 2002; Thissen, Steinberg, & Gerrard, 1986). Aside from the issue of model complexity, Edelen & Reeve (2007) suggest that smaller sample sizes can be used when the item response data meet the underlying assumptions of IRT. In other words, larger sample sizes are required to estimate stable parameters for poorly related items (Thissen, 2003). Correspondingly, larger sample sizes are required when items are written poorly. Second, sample size requirements vary with regard to the underlying purpose of the item calibration, since varying levels of precision may be acceptable for different purposes. For example, if items are being calibrated to generate parameter estimates for item banks used in high-stakes exams like the Graduate Record Examination (GRE), large sample sizes are needed to produce estimates with low standard errors. However, smaller sample sizes are adequate when the purpose is to evaluate properties of a questionnaire (Edelen & Reeve, 2007). Third, characteristics of the sample distribution may also influence the optimal sample size, since parameter estimates are generated from the sample. For instance, a large sample of homogenous individuals that

106

106

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

share similar ability or trait levels may not necessarily be better than a smaller sample that includes individuals of varying ability levels. In the aforementioned case, having a large homogenous sample would lead to highly precise parameter estimates but only for a limited range of ability along the construct. Therefore, the sample should be sufficiently well distributed across trait or ability levels such that items even at the extreme ends of the ability continuum have reasonable standard errors.

EXAMPLE OF IRT APPLICATIONS We use the Children of Immigrant Longitudinal Study to demonstrate the application of IRT using IRTPRO® software in this chapter (Portes & Rumbaut, 2012). We extracted a sample of 922 participants (n = 294 Vietnamese Americans, n = 628 Cuban Americans) who took the Rosenberg Self-Esteem Scale (RSES) to exemplify several applications of IRT. The RSES consists of 10 items that are rated on a 4-point Likert scale ranging from agrees a lot, agrees a little, disagrees a little, and disagrees a lot. Table 5.1 lists the items on the RSES. As in any other another approach, the first step is to look at the data and prepare it for analysis. In particular, the data should be coded in such a way that higher values reflect higher levels of the underlying trait—in the present example, self-esteem. This is consistent with how the response options were coded for all positively worded items Table 5.1 Description of Items on the Rosenberg Self-Esteem Scale Items 1. I feel I am a person of worth. 2. I feel like a person who has a number of good qualities. 3. I am inclined to feel like a failure. 4. I am able to do things as well as most other people. 5. I do not have much to be proud of. 6. I take a positive attitude toward myself. 7. I am satisfied with myself. 8. I wish I could have more respect for myself. 9. I certainly feel useless at times. 10. At times I think I am no good at all.

107

Item Response Theory in Cross-Cultural Measurement 107

(“I feel I am a person of worth”); 0 = disagrees a lot, 1 = disagrees a little, 2 = agrees a little, and 3 = agrees a lot. It is also important to recode all negatively worded items so that it reflects this ordering. For example, item # 3 states, “I wish I could have more respect for myself.” These items were recoded as 0 = agrees a lot, 1 = agrees a little, 2 = disagrees a little, and 3 = disagrees a lot. Once all the data is recoded, it is ready to be imported into a software program that has the capacity to run IRT analysis. Several existing software programs are available. The most commonly used ones include IRTPRO®, Multilog®, BILOG®, PARSCALE®, M-PLUS®, flexMIRT®, RUMM2030®, Winsteps®, and STATA®. Two free statistical packages from R® can run IRT analyses; they include “ltm” for IRT models and “eRm” for Rasch models (Rizopoulos, 2006; Mair, Hatzinger, Maier, 2012). More information about these software programs may be provided by the respective software vendors. Additionally, a few publications with more details on these programs have been published (Childs & Chen, 1999; Paek & Han, 2013). For the purposes of this demonstration, IRTPRO 2.1® will be used to run the analysis because of its easy to use Windows interface (i.e., no coding required). A step- by-step approach will be presented. For additional support, readers are encouraged to refer to the IRTPRO® user guide. To import data into IRTPRO®, open the software and click on “Import Data,” Figure 5.8. IRTPRO® can accommodate several different data file formats (Excel, SPSS, STATA, etc). For the purposes of this example, the following Excel CSV file was used: “RSES_Cuban & Vietnamese Americans.csv.” This file includes data on the responses to all 10 items of the RSES, as well as individuals’ sex (1 = Male, 2 = Female), and self- identified ethnicity (1 = Cuban American, 2 = Vietnamese American). After importing the data, click on the “Analysis” tab, and select “Unidimensional IRT,” Figure 5.9. A new screen will then populate (Figure 5.10). In the “Description” tab of that screen, a title for the analysis can be given; in this example the following title was used: “Graded Response Model Calibration of RSES with Cubans & Vietnamese.” Note here that the Graded Response Model will be used, since the items on the RSES have polytomous response options. Additionally, nested model testing (comparing the Graded Response Model with the Partial Credit Model) in a previous

108

108

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

Figure 5.8 Opening Display Page of IRTPRO®.

Figure 5.9 Starting the Analysis

publication confirmed that the Graded Response Model was a better fit (Gray-Little, Williams, & Hancock, 1997). After providing a title, click on the “Items” tab and add items v101– 110 (i.e., items #1–10 of the RSES) for analysis (Figure 5.11). Then, click on the “Models” tab to review each item’s response categories and to select the type of IRT model that will be used (Figure 5.12). As a default, IRTPRO® selects the Graded Response Model for all polytomous items, and the 2PL for all dichotomous items. To specify a different IRT model, left-click the cell associated with the item’s model and choose

109

Item Response Theory in Cross-Cultural Measurement 109

Figure 5.10 Analysis Screen: “Description” Tab.

another type of model. After selecting an appropriate IRT model, in this case the Graded Response Model, click the “run” button at the bottom right corner. A data output file will be then be displayed and automatically saved in the same folder that the original data was stored. A portion of the output file is displayed in Figure 5.13. The following steps can then be taken to interpret the data: first, check model assumptions and fit; second, evaluate and interpret item parameters; and third, visually inspect the categorical response curves, information function curves, and test information function curve. The three main IRT assumptions are (1) unidimensionality, (2) local independence, and (3) monotonicity. Unidimensionality can be checked by conducting an exploratory or confirmatory factor analysis. This analysis is not provided by IRTPRO® 2.1, and can be done using a different software program. For the present analysis, a principle components exploratory factor analysis was conducted using SPSS®. The results

110

110

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

Figure 5.11 Analysis Screen: “Items” Tab.

indicate a two-factor solution (Figure 5.14). However, the first factor (Eigen value = 3.9) was substantially larger than the second (Eigen value = 1.3), and explained 38% of the variance. Based on these results, it can be concluded that the RSES is sufficiently unidimensional. The standardized LD χ2 statistic values and slope parameter, a, estimates provided from IRTPRO® can be evaluated to assess potential violations to the local independence assumption. Table 5.2 displays the standardized LD χ2 statistic values provided by IRTPRO®. Values that exceed 10.0 are considered very large and unexpected. Based on those values, several item pairs (example: items #1 and #2) have LD χ2 statistic > 10. However, when examining the slope values across all items (Table 5.3), none exceed 4. These results suggest that the local independence assumption was not “completely” met. However, as described earlier, item parameters are robust to minor violations of this assumption, particularly when there is a clear dominant first factor.

111

Figure 5.12 Analysis Screen: “Models” Tab.

Figure 5.13 IRTPRO® Output File.

112

Developing Cross-Cultural Measurement in Social Work Research and Evaluation Scree Plot 4

3 Eigenvalue

112

2

1

0 1

2

3

4 5 6 7 Component Number

8

9

10

Figure 5.14 SPSS® Screen Plot of the 10-item RSES.

Visually inspecting the CRCs can be done to assess monotonicity. Ideally, the CRCs should be hierarchal in order. Figure 5.15 displays the CRCs generated by IRTPRO® for this analysis. On visual inspection, they are all hierarchal in order (i.e., the curve for response option = 0 precedes the curve for response option = 1, and the curve for response option = 1 precedes the curve for response option = 2, etc.). Additionally, monotonicity is further supported when evaluating the range in estimated b1, b2, and b3 parameter values from Table 5.3; b1 = –3.57 to –1.78; b2 = –2.60 to –0.29; b3 = –0.66 to 0.87. These values suggest that individuals with higher trait levels of self-esteem are more likely to endorse progressively higher response categories. Based on these results, there is sufficient evidence of monotonicity. Before moving on to the interpretation of item parameters, item fit should be evaluated. The S − X 2 statistic calculated by IRTPRO® can be used to assess item fit (Table 5.4). Recall that the null hypothesis is that the item fits well; therefore the goal is not to have significant results. It is important to note, however, that the p-values (i.e., probabilities) need to be adjusted for multiple comparisons to control

113

Table 5.2 Standardized LD χ2 Statistic Output Table Marginal fit (X 2) and Standardized LD X 2 Statistics for Group 1 Item

Label

Marginal X2 1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7 8 9 10

v101 v102 v103 v104 v105 v106 v107 v108 v109 v110

0.3 0.2 1.1 0 1.4 0.3 0.3 1.3 0.4 0.5

11.1 4.5 9.5 0.1 0.1 10.7 3 2.7

2.4 12 8.9 7.2 7.1 5.1 3

12.1 2.1 3 7.4 11.7 3.9

16.3 20.2 11.6 5.6 4.8

12.7 11.1 12.9 5.5

8.4 9.3 6.2

6.6 14.6

30.9

16.0 10.2 3.8 7.1 3.1 2.8 12.1 9.9 5.4

114

Group 1, v101

Group 1, v102

0.2 0.0

2 0 1

–3

2

0.6 1 0.4 0

0.2 –2

–1

0 Theta

1

2

0.0

3

–3

0.6

0.8 0 1 2

0.4 0.2

–2

–1

Group 1, v105

0 Theta

1

2

0.0

3

0.6

2

0.4 0

0.2 –3

–2

Group 1, v106 3

1.0

3

1.0

0.8 Probability

0.6

Group 1, v104 3

1.0

0.8 Probability

Probability

0.8

0.4

Group 1, v103 3

1.0

Probability

3

1.0

–1

0 Theta

1

2

0.0

3

–3

1

–2

Group 1, v107 3

1.0

–1

0 Theta

1

2

3

Group 1, v108 3

1.0

1.0 3

1

0.4

2

0.2 0.0

2

0.6 1

0.4 0.2

–3

–2

–1

0 Theta

1

2

0.0

3

–3

–2

–1

Probability

Probability

0 1 2

0.4 0.2 0.0

0.6

1

1

2

3

0.0

–3

3 0

0.6 1

0.4

–2

–1

0 Theta

1

2

3

0.0

–3

–2

–1

2

0 Theta

1

2

0.6 0.4

1

2

0.2

0.2 –3

2

0.4

Group 1, v110

0.8

0.6

0 Theta

1.0

3

0.8 0

0

0.2

Group 1, v109 1.0 0.8

Probability

0.6

0.8

0

Probability

0.8

0 Probability

Probability

0.8

3

Figure 5.15 Categorical Response Curves for the 10-item RSES.

–2

–1

0 Theta

1

2

3

0.0

–3

–2

–1

0 Theta

1

2

3

115

Table 5.3 IRT Parameter Estimates (a, b1, b2, b2) and Standard Errors (s.e.) Graded Model Item Parameter Estimates for Group 1, logit: a(-b) Item

Label

1 2 3 4 5 6 7 8 9 10

v101 v102 v103 v104 v105 v106 v107 v108 v109 v110

a 4 8 12 16 20 24 28 32 36 40

1.15 1.78 1.93 1.28 1.82 2.08 1.94 0.99 1.35 1.55

s.e.

b1

s.e.

b2

s.e.

b3

s.e.

0.11 0.15 0.15 0.11 0.14 0.16 0.15 0.09 0.10 0.12

–3.52 –3.36 –2.55 –3.57 –2.29 –2.40 –2.50 –1.78 –2.07 –1.87

0.31 0.27 0.16 0.30 0.14 0.14 0.15 0.15 0.14 0.12

–2.60 –2.23 –1.43 –2.45 –1.33 –1.45 –1.45 –0.29 – 0.49 –0.77

0.21 0.14 0.09 0.18 0.08 0.08 0.08 0.08 0.07 0.07

– 0.66 – 0.50 –0.58 – 0.46 –0.55 – 0.09 –0.10 0.87 0.67 0.04

0.08 0.06 0.06 0.07 0.06 0.05 0.05 0.10 0.08 0.06

116

116

Developing Cross-Cultural Measurement in Social Work Research and Evaluation Table 5.4 Output for Item Fit Statistics S-X 2 Item Level Diagnostic Statistics Item

Label

X2

d.f.

Probability

1 2 3 4 5 6 7 8 9 10

v101 v102 v103 v104 v105 v106 v107 v108 v109 v110

76.41 60.32 89.11 44.26 95.62 55.28 58.68 98.34 85.67 53.58

52 39 47 45 51 46 46 50 51 50

0.0153 0.0158 0.0002 0.5043 0.0002 0.1637 0.0992 0.0001 0.0017 0.3379

for potential type 1 error. When applying the Benjamini–Hochberg adjustment to this example, p-values > 0.001 are considered significant (Benjamini & Hochberg, 1995). Based on that cutoff value, items # 3, #5, and #8 appear to be “mis-fitting.” Given the sensitivity of the S - X 2 statistic to large sample sizes, this finding is not surprising and should be interpreted with caution. Taken all together, from the evaluation of model assumptions and item level fit, one can conclude that the calibration of the 10-item RSES using the Graded Response Model has good fit, is sufficiently unidimensional, and is robust to possible violations of the local independence assumption. The parameter estimates from this calibration can now be meaningfully interpreted. A helpful first step when interpreting the parameter estimates is to reexamine the range in parameter values. Based on Table 5.3, the discrimination parameter, a, ranges from 0.99 to 2.08. These results suggest that all items are at least moderately discriminating, with item #8 (“I wish I could have more respect for myself”) displaying the least and item #6 (“I take a positive attitude toward myself”) displaying the most discrimination. If all other factors were equal, and one were to decide to retain item #8 or #6, item #8 would be a stronger candidate item to retain because it is more discriminating. Examining the range in item difficult/location parameter values provides broad insight into where along the latent trait scale this set of 10 items target (i.e., what range of self-esteem is being measured). Based on Table 5.3, item difficult/ location parameter values range from: b1 = –3.57 to –1.78, b2 = –2.60 to –0.29, and b3 = –0.66 to 0.87. These

117

Item Response Theory in Cross-Cultural Measurement

values suggest that the 10-item RSES targets low (b = –3.57) to moderate (b = 0.87) levels of self-esteem—recalling that items with lower b values are considered “easier” or function at lower trait levels. If the underlying purpose or goal of this tool were to measure higher levels of self-esteem, these results would provide quantitative evidence to support the addition of new items to assess higher trait levels. Sorting the b1 and b3 parameter values from lowest to highest can also provide additional insight. In particular, it provides item-level details regarding where individual items function along the “self-esteem” scale (i.e., what regions of the latent self-esteem [theta] scale each item targets). Table 5.5 displays item parameter estimates sorted by b1 values. Based on that table, items #4, #1, and #2 function at relatively lower regions of the self-esteem scale, while items #9, #10, and #8 function at relatively higher regions of the self-esteem scale (i.e., items #4, #1, and #2 are better at assessing those with low self-esteem, and items #9, #10, and #8 are better at assessing those with moderate to high self-esteem). Table 5.6 displays item parameter estimates sorted by b3 values. Based on that table, items #1, #3, and #5 function at relatively lower regions of the self-esteem scale, while items #10, #9, and #8 function at relatively higher regions of the self-esteem scale. While the ordering of items differ slightly in Tables 5.5 and 5.6, a trend can be seen with certain items clustering at the lower or higher ends of the difficulty/location continuum. Potential item redundancy can also be easily detected when items are sorted by their b1 and b3 values. For example, in Table 5.5 the b1 values of items #3 (“I am inclined to feel like a failure”) and #7 (“I am satisfied with myself”) are relatively similar (b1 = –2.55 vs –2.50). If one Table 5.5 Item Difficulty/Location Parameter, b1, Sorted Item

a

s.e.

b1

s.e.

b2

s.e.

b3

s.e.

4 1 2 3 7 6 5 c9 10 8

1.28 1.15 1.78 1.93 1.94 2.08 1.82 1.35 1.55 0.99

0.11 0.11 0.15 0.15 0.15 0.16 0.14 0.10 0.12 0.09

–3.57 –3.52 –3.36 –2 .55 –2 .50 –2 .40 –2 .29 –2 .07 –1.87 –1.78

0.30 0.31 0.27 0.16 0.15 0.14 0.14 0.14 0.12 0.15

–2.45 –2.60 –2.23 –1.43 –1.45 –1.45 –1.33 – 0.49 –0.77 –0.29

0.18 0.21 0.14 0.09 0.08 0.08 0.08 0.07 0.07 0.08

– 0.46 – 0.66 – 0.50 – 0.58 –0.10 – 0.09 –0.55 0.67 0.04 0.87

0.07 0.08 0.06 0.06 0.05 0.05 0.06 0.08 0.06 0.10

117

118

118

Developing Cross-Cultural Measurement in Social Work Research and Evaluation Table 5.6 Item Difficulty/Location Parameter, b3, Sorted Item

a

s.e.

b1

s.e.

b2

s.e.

b3

s.e.

1 3 5 2 4 7 6 10 9 8

1.15 1.93 1.82 1.78 1.28 1.94 2.08 1.55 1.35 0.99

0.11 0.15 0.14 0.15 0.11 0.15 0.16 0.12 0.10 0.09

–3.52 –2.55 –2.29 –3.36 –3.57 –2.50 –2.40 –1.87 –2.07 –1.78

0.31 0.16 0.14 0.27 0.30 0.15 0.14 0.12 0.14 0.15

–2.60 –1.43 –1.33 –2.23 –2.45 –1.45 –1.45 –0.77 – 0.49 –0.29

0.21 0.09 0.08 0.14 0.18 0.08 0.08 0.07 0.07 0.08

–0.66 –0.58 –0.55 –0.50 – 0.46 –0.10 –0.09 0.04 0.67 0.87

0.08 0.06 0.06 0.06 0.07 0.05 0.05 0.06 0.08 0.10

wanted to shorten the RSES scale, retaining only one of those two items is reasonable. Factors that can help with that decision include: (1) comparing each item’s discrimination parameter, a, and prioritizing the item with the higher value (in this case items #3 and #5’s discrimination parameter is quite similar); (2) examining the range in b values and selecting the item that functions across a wider range (in this case item #3 had a wider range b= –2.55 to –0.58); (3) evaluating whether the items display differential item functioning (DIF) across different groups (this will be assessed in a forthcoming section); and (4) considering whether or not there is some underlying theoretical basis for retaining one item over another. Based on criteria (1) and (2), retaining item #3 over #7 can be empirically justified. In addition to empirically evaluating the parameter estimates, visually inspecting the information function curves and test information function curve that are derived from those parameter estimates can be helpful. Figure 5.16 displays the information function curves provided by IRTPRO® for each of the 10 items on the RSES. Based on that figure, some items provide more information than others (i.e., have larger areas under the curve). For example, item #7 provides more information than item #8. However, item #8 provides more information in the region of the trait scale associated with Theta = 2 to 3 than item #7. The test information curve is generated from summing the area under the information function curve for all items on a scale. Figure 5.17 displays the test information curve provided by IRTPRO® for the 10-item RSES. The inverse of the test information curve is the standard error curve (dotted line). Based on the test information curve, information

119

v102

1.0

1.0

1.0

0.8

0.8

0.8

0.8

0.6 0.4

0.6 0.4 0.2

–3

–2

–1

1 0 Theta

2

0.0

3

Information

1.0

Information

1.2

0.0

0.6 0.4 0.2

–3

–2

–1

v105

1 0 Theta

2

0.0

3

0.6 0.4 0.2

–3

–2

–1

v106

1 0 Theta

2

0.0

3

1.0

1.0

1.0

1.0

0.8

0.8

0.8

0.8

0.6 0.4

0.6 0.4

0.2

0.2

0.2

0.0

0.0

0.0

–3

–2

–1

1 0 Theta

2

3

–3

–2

–1

v109

1.0

1.0

0.8

0.8

Information

1.2

0.6 0.4 0.2

2

3

2

3

–3

–2

–1

1 0 Theta

0.6 0.4 0.2

–3

–2

–1

1 0 Theta

2

3

0.0

–3

–2

–1

1 0 Theta

–1

1 0 Theta

2

3

2

3

0.6 0.4 0.2

v110

1.2

0.0

1 0 Theta

Information

1.2

Information

1.2

0.4

–2

v108

1.2

0.6

–3

v107

1.2

Information

Information

v104

1.2

0.2

Information

v103

1.2

Information

Information

v101 1.2

Figure 5.16 Information Function Curves for Each Item on the RSES.

2

3

0.0

–3

–2

–1

1 0 Theta

120

Developing Cross-Cultural Measurement in Social Work Research and Evaluation Group 1, Total Information Curve 1.0 8

0.9

7

0.8

6

0.7 0.6

5

0.5

4

0.4 3

0.3

2

0.2

1 0

Standard Error

Total Information

120

0.1

–3

–2

–1

0

1

2

3

0.0

Theta Total Information

Standard Error

Figure 5.17 Test Information Curve for the RSES.

is fairly high at the ability range of -3.0 to 1.0. Within this range, ability is estimated with some precision, and, outside this range, precision decreases drastically. This highlights how measurement reliability can differ greatly across the latent trait scale, and why using a single measure of reliability (i.e., Cronbach’s alpha) often does not reflect reality. Up until this point, the main focus has been on interpreting results from an IRT analysis. Sprinkled into the discussion are brief insights into decisions about item selection in situations when one is interested in shortening a scale. It is worthwhile to note that these decisions may vary depending on the underlying measurement goal. In particular, if the goal is to create a screening measure that can precisely identify individuals above or below a clinically important cutoff point, selecting strong items that assess the underlying construct at a particular ability or location value corresponding to the cutoff point would be appropriate. Conversely, if the goal is to create a shortened measure that can be used to make meaningful interpretations in change scores (i.e., to assess intervention affects or assess associations with other variables), items should be selected such that there is (1) adequate conceptual content coverage, (2) even item density across a range of ability or location, and (3) minimum loss of precision. While the reasons why it is important to ensure adequate content

121

Item Response Theory in Cross-Cultural Measurement

coverage and minimum loss of precision are obvious, it is worth clarifying why having even item density is also important. Having even item density ensures that changes scores can be interpreted fairly, particularly if summed scores are used over IRT scores. In the absence of even item densities, numerical gains may inflate scores among some individuals when there are high concentrations of items in certain regions of the underlying trait (Stucki, Daltroy, Katz, Johannesson, & Liang, 1996).

TESTING FOR DIFFERENTIAL ITEM FUNCTIONING Last but not least, IRT approaches can be used to examine the metric equivalence of the RSES between the sample of Cuban and Vietnamese Americans in this dataset by testing for differential item functioning (DIF). To do this, go back to the home page and click on to the “Analysis” tab at the top of the menu bar, then click on “Unidimensional IRT.” The analysis screen will repopulate (Figure 5.18). A two-step process is required to test for DIF. First, we need to identify “anchor items”

Figure 5.18 Analysis Screen: Inserting Additional Tests.

121

122

122

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

that showed no DIF in either the a or b parameters. Anchor items serve to “bridge” or equate the underlying scale across the two groups. After anchor items are identified, they will be used to test for DIF in all other candidate items. To conduct these two steps, left-click on the “Test 1” tab and insert another test (Figure 5.18). Rename the Test 1 “Sweep” and Test 2 “Anchored” (Figure 5.19). The “Sweep” test will be used to find anchor items. To keep track of the data file output, the following title can be given to this test: “Sweep: Test all items for DIF given group differences computed with all items equal” (Figure 5.20). Next, click on the “Group” tab, and add the “Ethnicity” variable into the group box. This tells IRTPRO® to evaluate items across the two ethnic groups (1 = Cuban American, 2 = Vietnamese American; see Figure 5.21). Next, click on the “Items” tab and add items v101 to v110 into the Items box, and click on the “Apply to all groups” button (Figure 5.22).

Figure 5.19 Analysis Screen: “Sweep” and “Anchored”.

123

Figure 5.20 Description for “Sweep” Tab.

Figure 5.21 Specifying Groups to Be Compared.

124

124

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

Figure 5.22 Selecting Items.

Then, click on the “Models” tab, and click on the “DIF” button in the bottom left corner. A DIF analysis screen will appear with several options; select “Test all items, anchor all items” and click “Ok” (Figure 5.23). Finally, click on the “run” button to initiate the Sweep analysis. The analysis will produce DIF statistics for all items (Table 5.7). The null hypotheses across each of the DIF statistics is that the a and b parameters are the same across both ethnic groups. Therefore, items with nonsignificant DIF statistics are ideal anchor items. Based on Table 5.7, items #1 and #10 are ideal anchor items. With these anchor items in mind, go back to the “Anchored” test tab (second part of the process), and repeat all the steps up until you get to the DIF analysis screen. In this situation, select “Test candidate items, estimate group difference with anchor items” (Figure 5.24). Then, add item #1 (v101) and item #10 (v110) into the “Anchor items” box, and all other items into the “Candidate items” box. Click “ok” and hit the “run” button.

125

Figure 5.23 DIF Sweep Analysis.

Figure 5.24 DIF Anchored Analysis

126

Table 5.7 DIF Statistics to Identify Anchor Items DIF Statistics for Graded Items Item numbers in: Group 1

Group 2

Total X 2

d.f.

p

X 2a

d.f.

p

X 2 cia

d.f.

p

1

1

6.6

4

0.1568

2.9

1

0.0887

3.7

3

0.2958

2

2

7.0

4

0.1366

4.4

1

0.036

2.6

3

0.4608

3

3

16.3

4

0.0026

2.8

1

0.0979

13.5

3

0.0036

4

4

19.2

4

0.0007

2.3

1

0.1318

16.9

3

0.0007

5

5

18.4

4

0.001

0.1

1

0.7516

18.3

3

0.0004

6

6

11.2

4

0.0244

0.0

1

0.8807

11.2

3

0.0108

7

7

9.7

4

0.0459

0.0

1

0.9256

9.7

3

0.0215

8

8

14.9

4

0.0048

0.0

1

0.8616

14.9

3

0.0019

9

9

14.0

4

0.0074

7.0

1

0.0081

7.0

3

0.0734

10

10

7.6

4

0.1083

1.0

1

0.3082

6.5

3

0.0883

127

Table 5.8 DIF Statistics Using Item #1 & #10 as Anchored Items DIF Statistics for Graded Items Item numbers in: Group 1

Group 2

Total X 2

d.f.

p

X 2a

d.f.

2 3 4 5 6 7 8 9

2 3 4 5 6 7 8 9

5.9 8.6 21.5 3.1 11.5 12.6 15.4 7.0

4 4 4 4 4 4 4 4

0.2082 0.0705 0.0003 0.5408 0.0215 0.0133 0.0039 0.1381

2.0 0.8 1.1 0.2 0.0 0.0 0.0 2.6

1 1 1 1 1 1 1 1

p

X 2 cia 0.1576 0.3623 0.2876 0.6762 0.8672 0.9911 0.9104 0.1063

7.8 20.4 2.9 11.5 12.6 15.4 4.3

d.f. 3.9 3 3 3 3 3 3 3 3

p 0.2742 0.0500 0.0001 0.403 0.0094 0.0055 0.0015 0.2293

128

128

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

The analysis will produce DIF statistics using items #1 & #10 as anchored items (Table 5.7). Based on that table, items #2, #3, #5, and #9 do not demonstrate significant DIF between Cuban and Vietnamese Americans. In other words, of the 10 items, 4 items (#4, #6, #7, and #8) demonstrate significant DIF across these two groups. As discussed earlier, this may be due to poorly written items or items that are interpreted differently by different samples. In summary, this chapter provided a broad overview of IRT and walked the reader through a basic analytic example using data from 922 participants who took the 10-item RSES. As demonstrated, IRT is a modern measurement theory that focuses mainly at the item level as opposed to the test level, and offers item level details not provided through classical approaches. Moreover, the analytic example demonstrated how this alternative model-based theory was not only more plausible, but also has the potential to solve practical measurement problems that are difficult to address using classical approaches. In the following Chapter 6, we discuss the process of developing new instruments for cross-cultural research and demonstrate this process by a semester-long classroom exercise with a group of doctoral students of Boston College School of Social Work. The students were involved in developing a scale designed to measure empathy, from conceptualization to the assessment of its validity and reliability.

129

6

Developing New Instruments: Empathy Scale

Developing new cross-cultural research instruments is an enormous task, and it requires careful consideration from the researchers to ensure that the instruments measure what they are designed to measure, and that they can also capture cultural differences and similarities among the comparative groups. It is always challenging to develop an “etic” instrument that captures the shared meanings among the comparative cultural groups and an “emic” instrument that can measure the unique aspects of each cultural group (Smith, 2004). Developing and constructing cross-cultural research instruments must be a collaborative endeavor of the research team and the stakeholders. Inputs from cultural experts, prospective research respondents or clients, and service providers should be an integral part of every step or phase in cross- cultural measurement development and construction. This chapter focuses on the foundation of measurement and the process of cross-cultural instrument development. The foundation of measurement involves some fundamental elements such as concept, 129

130

130

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

indicators, latent variables, reflective or effect indicators, causal or formative indicators, and theoretical framework. These terms will be illustrated by examples relevant to social work. Once the research concepts and indicators are identified and defined, researchers need to transform them into a research instrument such as a survey questionnaire or a standardized scale. In cross-cultural research or evaluation, the selected concepts, indicators, and latent variables must be culturally relevant and appropriate among the target populations or communities.

FOUNDATION OF MEASUREMENT The goal of a scientific social work theory is to explain the relationships among phenomena or constructs according to a set of facts, propositions, or principles. There are two aspects of a theory—one explains the relationships between theoretical constructs, such as self-esteem and depression; the other explains the relationship between a construct and its measures, such as depression and the items of a scale that were developed to measure depression. Researchers have no basis to explain the relationships between self-esteem and depression if these variables are not defined and measured. However, self-esteem and depression are psychological states that cannot be observed and measured directly as can income, education, or age. Researchers have to identify observable and measurable indicators to capture the dimensions and levels of self- esteem and depression before they can test or evaluate their relationships statistically. In measurement development, researchers focus on the relationships between the constructs and their respective observable indicators (Edwards & Bagozzi, 2000). The goal of scientific social work inquiry is to investigate the relationships among variables—t hat is, whether they are correlational or causal. Many human behaviors, attitudes, social and psychological conditions, or phenomena cannot be observed directly. Therefore, researchers have to construct measures to quantify the abstract concepts of interest, such as client satisfaction or motivation. Bollen (1989) stated, “Measurement is the process by which a concept is linked to one or more latent variables, and these are linked to observed variables” (p. 180). For example, client satisfaction is an abstract concept or a psychological reaction to external observable conditions or situations (e.g., interventions, services) and, in

131

Developing New Instruments

order to ascertain whether the clients are satisfied or how much they are satisfied with the services or interventions they received from social workers or social service agencies, there must be a systematic way to determine such abstract psychological reactions. Researchers have to develop measurable instruments that carry some numeral values, hierarchical order, or categories that indicate whether clients are satisfied or not (dichotomous measure); whether their satisfaction is at low or high levels on an ordinal scale from low to high; or whether their satisfaction can be determined on a scale from 0 to 20 (interval or ratio). Whatever levels of measurement we use, we need to have a way to observe and record client satisfaction from the clients. In the social sciences literature, the abstract variables or concepts are called latent variables. We can only measure the latent variables through their observable indicators. In this case, researchers can measure the latent variable of client satisfaction by a set of questions (i.e., observable indicators) designed to capture all possible aspects and levels of client satisfaction. This is the process of developing observed indicators and linking these indicators to their respective latent variables.

DEFINING AND MEASURING CONCEPTS If social work researchers want to measure client satisfaction in a social service agency, then the most feasible and practical way to assess this construct is to ask the clients directly. The researcher needs to define client satisfaction in clear and concise terms. For example, client satisfaction can be defined as the extent to which the services meet the clients’ needs and expectations regarding the quality of the services. As a result, this construct would appear to have two aspects or dimensions: satisfaction with service needs and satisfaction with the quality of received services. In other words, this concept of client satisfaction has two latent variables. Latent variables are abstract, and to quantify them the researcher must develop indicators or observed variables that reflect or represent the meaning and levels of client satisfaction. The diagram in Figure 5.3 illustrates the measurement process. The two circles represent two abstract dimensions or components of client satisfaction. The line connecting the circles indicates their correlation. Each circle has four

131

132

132

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

squared boxes named from X1 to X8. The squared boxes represent the observed indicators of the latent variables. The arrow connecting the two latent variables indicates the correlation or covariance between them. The arrows from the latent variables to the observed indicators indicate the causal relationships between the latent variables and their respective observed indicators. Each of the arrows can be understood as a regression coefficient of an observed indicator on its respective latent variable. The arrows under the observed indicator boxes indicate the unique variances or measurement errors.

REFLECTIVE OR EFFECT INDICATORS The proposed measurement model in Figure 6.1 illustrates the causal relationships between latent variables and observed indicators. In measurement theory, the squares are the reflective measures of the circles. Therefore, the abstract constructs are the causes of the observed indicators (Fornell & Bookstein, 1982; Edwards & Bagozzi, 2000; Markus & Borsboom, 2013). The observed indicators are called reflective measures or effect indicators. The causal relationship between a construct and its indicators or measures should meet all conditions of causality. First, the construct and its indicators are two distinct phenomena. Second, the construct and its indicators must covary. Third, the construct must exist before its indicators. Finally, there should be no rival explanation for the

Need Satisfaction

x1

x2

x3

Service Satisfaction

x4

Figure 6.1 Reflective Measurement Model.

x1

x2

x3

x4

133

Developing New Instruments

causal relationship between the construct and its indicators (Edwards & Bagozzi, 2000).

CAUSAL OR FORMATIVE INDICATORS There are situations where the observed indicators or measures are causes of latent constructs. This type of indicator is called a formative measure or causal indicator (Blalock, 1971, Bollen & Lennox, 1991; MacCullum & Browne, 1993; Kyle & Jun, 2015). For example, socioeconomic status can be defined as a latent construct caused by observed indicators such as education, income, job, and neighborhood (Hauser, 1973). In social work we can define, among other things, quality of social work services as the effect of availability of trained social workers, accessibility of services, and caseload. The conventional rules of reliability and validity are not always applicable for the causal indicator measurement model (Bollen & Lennox, 1991). The measurement model in Figure 6.2 does not require that the causal indicators must be correlated as are those in a reflective indicator model (see Figure 6.1). Thus the freeform lines between “Trained social worker,” “Accessibility,” and “Caseload” are not specified, as they are correlated. This type of causal indicator measurement model has not been used among social work researchers. Although causal indicators are theoretically meaningful, the methodology and technology to assess their measurement properties are not readily available for most

Trained Social Worker

Assesibility

Caseload

Figure 6.2 Causal Indicator Measurement Model.

Quality Serves

133

134

134

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

researchers. Only effect indicator or reflective indicator measurement models are discussed in this book.

MEASUREMENT PROCESS As illustrated in Figure 6.1, to measure client satisfaction the researcher needs to first define client satisfaction in terms of need satisfaction and service satisfaction. Then he/ she must construct eight questions or items that help him/her measure the degree of client satisfaction in such a way that he/she can quantify the levels of satisfaction from low to high. Keep in mind that the definition of an effect indicator or reflective indicator is different from the causal or formative indicator. Researchers should avoid including both effect indicators and causal indicators in one measurement model, because the mixture of both effect and causal indicators can result in poor internal consistency or conceptual confusion. In theory, these eight observed variables (indicators or items) should be drawn from a finite number of observed variables. However, it is difficult or impossible to come up with such a finite pool of observed variables for all conceivable variables. The researcher needs to link the observed variables to their respective latent dimensions (latent variables) conceptually and statistically. Conceptually, the observed items must capture the theoretical definition of the specific dimension and its overall latent variable. In this example, the four items that measure need satisfaction must reflect the meanings of need satisfaction, and the four items that measure service satisfaction must reflect service satisfaction. Statistically, the researcher must demonstrate that the four items of each dimension correlate among themselves. Finally, a construct must be conceptualized within a theoretical framework that explains how a particular construct relates to other constructs under certain situations or circumstances. In the example of client satisfaction, the researcher might theorize that client satisfaction would relate to the client’s quality of life. In other words, clients seek social services to improve their quality of life. Therefore, they would only say that they were satisfied with the services if these services really improved their life situations. Having a clear theoretical framework helps researchers to conceptualize the measurement process in a more effective and

135

Developing New Instruments

meaningful way. It is always important to have a clear purpose in the attempt to construct a scale or a measurement. A theoretical framework also helps the researcher to distinguish the constructs from each other to avoid the development of an indicator that shares similar meanings with more than one construct. For example, an indicator of client satisfaction must not be an indicator of quality of life in the same situation. Constructs can also be hierarchical in their orders. For example, need satisfaction and service satisfaction are two distinct constructs but represent a higher order construct that is client satisfaction. From the confirmatory factor analysis perspective, need satisfaction and service satisfaction are the first-order factors of client satisfaction (see Figure 6.3(a), and Figure 6.3(b)). When a multidimensional scale is used as a composite scale, it is assumed that there is a second-order factor that captures all correlated first-order factors. (a) Need Satisfaction Quality of Life Service Satisfaction

(b) Client Satisfaction

Need Satisfaction

Quality of Life

Service Satisfaction

Figure 6.3 (a) First Order Measurement Model. (b) Second Order Measurement Model.

135

136

136

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

For example, the Center for Epidemiological Studies Depression Scale (Radloff, 1977) is conceptualized as multidimensional because it has been proven to have at least four factors. Researchers have used it as a composite scale by summing up the scores of its 20 items to arrive at a composite score that ranges from 0 to 60. In theory, the overall score should represent a higher order factor that captures the scores of the first-order factors of the scale. Finally, having a theoretical framework helps researchers to choose appropriate statistical approaches for data analysis. The diagrams depicted in Figures 6.3A and 6.3B suggests that client satisfaction has a determinant effect on quality of life. The path model in Figure 6.3A illustrates the relationship between client satisfaction as measured by two first-order factors of need satisfaction and service satisfaction. However, in Figure 6.3B, client satisfaction is measured by a second-order factor encompassing need satisfaction and service satisfaction. In both cases, improving quality of life is the expected outcome of client satisfaction. Clearly defined concepts will help researchers and evaluators develop meaningful and appropriate indicators for what they want to measure (Bollen, 1989). It is important for social work researchers to have a clear purpose in selecting or developing a research instrument. Social work researchers and evaluators can define a research concept from existing literature or their own clinical experiences. Clinical experiences would probably provide more practical insights where there is a lack of previous research in cross-cultural settings.

DEVELOPING MEASUREMENT INDICATORS Once a research problem is identified, the researcher needs to define the problem in terms of variables and constructs. After the research variables or constructs are identified and agreed upon by the research team, measurable indicators for the variables or constructs must be developed. Ideally, the indicators must be randomly drawn from a pool of all possible items, but in reality there is no such exhaustive pool of items. Researchers and evaluators will have to work with what they have, given that there has been a reasonable attempt to identify the items.

137

Developing New Instruments

In cross-cultural research and evaluation, researchers have to select the indicators or observed variables that have similar meanings across cultures. The combination of cultural expert meetings, vignette probing, and focus groups can be used to identify and construct cross- cultural instrument indicators or questions.

CROSS-CULTURAL EXPERT GROUP MEETING Cross-cultural expert meetings should be held to develop indicators or questions for a cross-cultural research instrument. The purposes of the expert meetings are to generate as many indicators as possible, identify the emic and etic indicators, and develop vignettes reflecting the research aims for in-depth-interviews and key questions for focus groups. Each cultural group participating in the cross-cultural study should have at least two cultural experts, a female and a male. Recent research has suggested that men and women have different patterns of communication (Costa, 1994; Cameron, 1988). Expert members should have a good understanding of the meanings and purposes of the research aims. The experts can meet through a designated Internet “chat room,” Internet video conferences, and a face-to-face meeting. Internet chat room discussion and video conferencing should be available before and after the face-to-face meeting. Members can generate as many items for the instrument as possible before participating in a face-to-face meeting. Results from the face-to-face meeting can be discussed again over the chat room or video conference. The results from Internet chat room, video conferences, and face-to-face meetings are combined and evaluated by the research team. All information should be archived for future evaluation and reference.

VIGNETTE INTERVIEWS The research team works with the cultural expert group to develop vignettes that can stimulate participants to share or reveal their experiences and insights that can help the research team identify

137

138

138

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

relevant measurement indicators of the key research constructs from the prospective respondents. Vignette interviews help the researchers to (1) understand how people from different cultural populations understand the meanings of the research questions; (2) generate culturally relevant measurement indicators for the research concepts or variables; and (3) assist the researchers to choose the right and meaningful research questionnaire (see Martin, 2004). For example, if the aim of the cross-cultural research project is to compare the role of social support as a moderating factor of the negative impacts of natural disasters on psychological well-being across cultures, then the vignettes should present a hypothetical or real disaster situation that can trigger respondents’ thinking about how they would cope with such a situation. The interviewer can use the vignettes to probe for the ways that respondents would react to the disasters. Through vignette probing, researchers can identify indicators of social support and psychological well-being from respondents across cultural groups. Vignette probing should be audio-recorded, transcribed, and edited for comparison and triangulation with data collected from focus and expert groups. Vignette interviews should follow the following principles: 1. Vignettes should be related to the research aims and hypotheses. 2. The same vignette should be presented to members of different cultural groups. 3. At least 12 vignette interviews (6 females and 6 males) should be conducted for each cultural group. This sample size is recommended for qualitative research (Fayers & Machin, 2007). 4. Respondents should be encouraged through probing to give as many interpretations of the vignette as possible. 5. Respondents should be asked to offer their own examples in explaining their understanding of the vignettes. For example, when asking about how they would react to a disaster, interviewers should also probe the respondents to provide concrete examples from their own experiences, imagination, or observations from people they know.

139

Developing New Instruments 139

The success of vignette interviews or other types of interviews is determined by the quality of interviewers and the participants or respondents. Information collected from vignette interviews are combined, transcribed, translated, and edited. The research team needs to compare the results from the expert group and vignette interviews to develop a preliminary list of items that will be compared with the items generated from focus group meetings.

CROSS-CULTURAL FOCUS GROUP MEETING Focus group methodology has been well-developed and employed in different research settings (Cote-Arsenault & Morrison-Beedy, 1999; Stewart & Shamdasami, 1990; Krueger, 1994). Well- planned focus groups can help researchers gain insight into a cultural group; develop meaningful research agendas, questions, and hypotheses; generate measurable cross-cultural indicators for research constructs or variables; and evaluate the appropriateness of research instruments. Each cultural group participating in the cross-cultural study should have at least four to six members with equal representatives of both sex to participate in a global (all groups) cross-cultural focus meeting to identify indicators for the research instrument or questionnaire. If it is possible, focus groups within each cultural group can meet prior to the global cross-cultural focus group. In each cultural or ethnic, linguistic, or racial group, the team should conduct three focus groups, including two sex-specific focus groups: females (3–5 participants) versus males (3–5 participants) and one combined sex focus group (6–12 participants with equal numbers of sex representatives). The author’s experiences with focus group meeting in the Vietnamese population suggest that sex-specific focus groups provide an easy-going atmosphere for the participants to reveal their thoughts or share their personal experiences. Each cultural focus group can work independently to develop as many indicators as possible. However, all groups should agree on the meanings of the research aims and definitions of key research

140

140

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

constructs before group members work on the construction of indicators. Transcripts from these meetings should be translated into a common language when the groups are from different linguistic populations.

DATA SYNTHESIS Once the research team compiles data from experts, vignette interviews, and focus groups, the research team needs to synthesize the data to produce a comprehensive list of meaningful items that are both relevant to the research aims and culturally appropriate to all participating cultural groups. The next step is to put together a preliminary research questionnaire or instrument for further evaluation.

INSTRUMENT FORM Once there is a list of acceptable and meaningful indicators for the research constructs, the team will have to decide on the appropriate structure and form for these indicators (Schaeffer & Presser, 2003). The most convenient and practical way of measuring human behaviors, attitudes, mental status, opinions, and other aspects life is by asking questions. As researchers attempt to measure human behaviors for theory building, theory testing, policymaking, or evaluation of treatment efficacy, they must ask the right questions to get the right data for their answers. Fortunately, questionnaire development has become more scientific with proven effective methods for asking questions. Schaeffer and Presser (2003) have noted that survey researchers often focus on two types of questions: “questions about events or behaviors and questions that ask for evaluations or attitudes” (p. 65). The research indicators or variables can be framed as “open” or “closed” questions. Each type has advantages and disadvantages. For example, open questioning allows the researchers to collect more in-depth and current information from the respondents but is less efficient in terms of data management and analysis. Open questions have the “ability to capture answers unanticipated by questionnaire designers” (Martin, 2006, p. 6). Closed questions are more efficient but can miss important information because of the

141

Developing New Instruments

lack of well-defined categories. Schwartz and Oysrman (2001) offered a good review on the types of questionnaire response formats and their advantages and disadvantages. There is no perfect response format for any research instrument; the researchers will have to choose those that are feasible and meaningful for different cross-cultural settings. Survey methodologists have suggested several standardized techniques for questionnaire development (Converse & Pressner, 1986; Schaeffer & Pressner, 2003). An effective survey instrument must satisfy some basic conditions, as suggested by Fowler and Cannell (1996). Followings are six criteria for good survey questionnaire design: 1. The questions must be culturally relevant to respondents of diverse cultural backgrounds. 2. Respondents must understand the intention of the questions. 3. Researchers must avoid asking questions for information that is not available to the respondents. 4. Respondents must be able to recall or retrieve information relevant to the questions. 5. Respondents can translate the information they have into the standardized forms of the questions. 6. Respondents’ answers to the questions must be truthful and accurate. Although these six basic principles of instrument development appear to be simple and common sense, researchers should remember that these principles do not offer a perfect solution for instrument development. They are useful criteria and should be adapted as the guide for research instrument construction. For respondents to understand the true meanings of questions, items, or statements, they must be written in a clear and precise language. The language of a research instrument must be free from sex, age, educational, racial, and religious biases. Researchers should avoid questions that require respondents to search for information that is not readily available or information that they are not accustomed to remembering. For example, in a society where people do not celebrate their birthdays (e.g., rural villages in Vietnam), it will be very difficult for the researchers to ask the respondents to provide information about their parents’ or siblings’ birthdays. If this is an important question for the research project, then respondents should

141

142

142

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

be informed in advance or allowed adequate time to get or retrieve the information. Standardized formats for research instrument questions can be found in most basic textbooks on research methodology (DeVellis, 1991; Fayers & Machin, 2007). Members of the research team, experts, and prospective participants should agree on which standardized question formats or scale formats are suitable for the research project. The formats or forms and the contents of the questions and answers must be easily understood and familiar to the respondents for them to provide appropriate responses to the questions.

CROSS-CULTURAL EVALUATION Once a pool of items is developed and compiled into a preliminary questionnaire or a research instrument, it is subjected to a series of cross- cultural evaluations. Four approaches to cross- cultural evaluation are described: cultural expert evaluation, cognitive interviews, focus groups, and pilot testing.

CULTURAL EXPERT EVALUATION The purpose of cultural expert evaluation is to seek consensus among the cultural experts representing the selected cultural or racial/ethnic groups on the cultural validity and equivalence of the newly developed instrument. Expert appraisal has been found as an effective means of survey questionnaire evaluation (Lessler & Forsyth, 1996; Forsyth, Levin & Fisher, 1999). Researchers can use the same expert group that worked on the development of the instrument items, or recruit a new group to engage in the evaluation of the newly developed instrument. This task of evaluation can be performed through Internet chat rooms, video conferences, E-mails, and a face-to-face meeting. If it is possible, a face-to-face meeting should be arranged for all participating cultural experts to evaluate the contents and format of the newly developed instrument. Members of this cultural expert group should be composed of professionals from the target populations and should have appropriate training and clinical experience concerning the problem under

143

Developing New Instruments 143

investigation. The experts are asked to evaluate at least four aspects of each research item in the instrument: concept, clarity, appropriateness, and difficulty. The evaluation matrix presented in Table 6.1 can be used as a part of expert evaluation. It can enhance the data collected from expert focus group meetings or discussion via other forms of electronic communication. This evaluation matrix can also be used with cognitive interviews and focus groups as a means of collecting additional standardized data for the overall evaluation of the instrument. Concept. Committee members evaluate the conceptual linkage between the suggested items and their respective research ideas or variables. The key issue is whether the suggested items reflect the meanings of the concepts that they are supposed to represent. Table 6.1 Cross-Cultural Evaluation Matrix of New Measurement Items Member 1

Member 2

Member 3

Yes No Yes No Yes No (Explain) (Explain) (Explain) (Explain) (Explain) (Explain) Item 1 Meaningful Concept Clarity Appropriateness Difficulty Item 2 Meaningful Concept Clarity Appropriateness Difficulty Item 3 Meaningful Concept Clarity Appropriateness Difficulty Item k Meaningful Concept Clarity Appropriateness Difficulty

144

144

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

Clarity. Committee members evaluate the use of words and syntax of the item. Can respondents from different cultural backgrounds clearly understand the item or question in a similar manner? Is the item written in a manner that is free from jargon or parochial idioms? Appropriateness. Committee members determine whether the suggested items are culturally appropriate in each cultural group. Does the item or question require information that is taboo for some cultures, or can it induce embarrassment for the participants? Difficulty. Committee members determine whether the suggested items are difficult for prospective respondents or participants to understand and respond to. Does the item require respondents to spend a lot of time to think or gather information?

COGNITIVE INTERVIEWS Researchers have suggested that cognitive interviews should be used to evaluate the research questionnaire (Rothgeb, Willis & Forsyth, 2005). Cognitive interviewing is designed to identify potential sources of measurement errors in research instruments or questionnaires (O’Brien, Fisher, Goldenberg, & Rosen, 2001). This method can be used to (1) appraise respondents’ comprehension of each question or item in a research instrument; (2) judge the consistency of respondents’ comprehension; and (3) evaluate the congruency between respondents’ comprehension and the researchers’ intention (Collins, 2003). Self- reporting questionnaires or data collection instruments are “highly context dependent” (Schwartz & Oyserman, 2001). In cross- cultural research settings, the self-reporting data collection instruments are also highly cultural-and linguistic-dependent. This suggests that the ways that researchers construct their data collection instruments will definitely influence the data they obtain. The ultimate goal of the research instruments is to collect the correct data for the research purposes. Unfortunately, previous research has revealed that respondents or participants often do not provide the quality data the researchers intended to gather through their research instruments (Schwartz & Oyserman, 2001). It is not a simple task to construct a question or a measurement item that exhibits the same meanings for all people. This problem becomes even greater in cross-cultural research. Therefore, the evaluation of the

145

Developing New Instruments 145

research instruments prior to data collection is crucially important for the success of the research project. One of the best ways to assess the quality of a research instrument is to directly ask the respondents how they understand the meanings of the research instruments and whether their comprehension truly reflects the researchers’ intentions. Cognitive interviewing has been used as a “pretesting” approach to detect potential errors in cross-cultural questionnaire development (Agans, Deeb- Sossa, & Kalsbeek, 2006). The principles of a cognitive interview are based on the four-stage cognitive process of respondents or participants. This process encompasses the respondents’ ability to (1) understand the information presented to them; (2) retrieve relevant information; (3) evaluate the information; and (4) communicate the information (O’Brien, Fisher, Goldenberg, & Rosen, 2001). When respondents are presented with the items of the questionnaire, they are expected to be able to understand the meanings of the items, to retrieve any relevant information or knowledge related to the items, to judge or evaluate the content of the items, and to communicate their attitudes, thoughts, or feelings concerning the meaning, relevancy, and appropriateness of the items by choosing among the answers presented to them in the closed questions (e.g., very happy, somewhat happy, or not happy). There are two approaches to cognitive interviewing: the think-aloud approach and verbal probing. Each has advantages and disadvantages (Willis, 1999). Researchers can use both approaches simultaneously. Cognitive interviews can be conducted in a laboratory or in the field. For a cross-cultural research project, it is suggested that cognitive interviews be conducted in the field to reflect the “real-life” situations of the participants. The think-aloud approach is straightforward; the interviewers read the item or question from the questionnaire and ask the participant to tell what he/she is thinking about. The verbal probing approach requires the interviewers to ask more related questions to extract detailed information from the participants on a specific item or question. The preparation for a cognitive interview includes the following steps: 1. Recruiting and training interviewers. Although cognitive interviewing requires minimum interview training, the research team should recruit interviewers who are culturally

146

146

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

and linguistically competent to carry out successful interviews. Interviewers should practice reading the items or questions of the questionnaire several times before engaging in the interviews. Interviewers and participants should be the same sex. This creates a more “comfortable” zone of interaction between interviewers and interviewees. Interviewers must be trained to have a good understanding of the purpose of the research, the meanings of the research questionnaire and its items, and the skills to probe for in-depth answers and encourage the participants to reveal their thoughts and feelings about each item of the questionnaire. Interviewers should frame the interview questions around four aspects: concept, clarity, appropriateness, and difficulty. 2. Recruiting and training participants. Six to 12 cognitive interviews should be conducted for each cultural group. An equal number of female and male respondents must be the rule in all evaluation processes. Before an interview, participants should be trained to be familiar with the interviewing process; for example, they should listen to each question carefully and communicate their thoughts about the questions freely. Participants should also be prepared to respond to the additional probing willingly. 3. Having the right equipment for data collection. Digital recording technology is an ideal means of verbal data recording, as it allows the researchers to conduct longer interviews (several hours of interviewing without interruptions), transfer interviewing data to PC, and save interviewing data into different folders. To improve the quality of the data, all participants should have the full instrument in their own language to review at least one week prior to the interview. The more they are familiar with the instrument, the better they can provide their thoughtful assessment of it. Traditional qualitative interview techniques can be used for cognitive interviews (Berg, 2004). Interviewers must be fluent in the language of the participants and also should have an adequate knowledge about the culture of the participants.

147

Developing New Instruments 147

In addition to in-depth information from cognitive interviewing, the evaluation matrix presented in Table 6.1 can be used to collect additional quantitative data for the evaluation purposes. This evaluation matrix allows researchers to collect additional quantitative data to complement the qualitative data of cognitive interviews.

FOCUS GROUP Focus groups are used to assess the cultural validity of the pool of possible indicators generated by experts and participants from the vignette interviews. The procedure and techniques of focus group methods have been well discussed in the literature (Morgan, 1988). The research team should list all key research concepts and their respective indicators and make them available to all invited focus group participants at least one week prior to the meeting. There should be two types of focus group: within-g roup and between (global)-g roup focus groups. Let’s say that a social work researcher wishes to study the prevalence of depression in Latino and Asian-A merican communities. These two broad ethnic populations encompass several subgroups that have their own culture, values, and languages. It would be imprudent to assume that these subgroups have a similar cultural background and that depression bears the same meaning and manifestation among these groups. The researcher can approach this problem by creating a multicultural advisory group that serves as a means to screen and synthesize cultural similarities among the selected groups. The research team should conduct at least three focus groups for each ethnic group. Two focus groups should be sex-specific, and the third is the combination of both sex groups. Considering sex differences in the process of measurement development is an effective way to avoid sex biases in both measurement and outcomes.

DATA SYNTHESIS Evaluation data from expert groups, cognitive interviews, and focus groups will be synthesized to provide a comprehensive and in-depth assessment of the preliminary research questionnaire or instrument.

148

148

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

This task can take several days because of the labor-intensive demand of the interview transcriptions, translation, and data analysis. Once the research team agrees on the final draft of the questionnaire, it is ready to be tested in the field.

PILOT TESTING A pilot-structured interview via a purposive survey is conducted to test the cross-cultural research questionnaire developed from experts, cognitive interviews, and focus groups. If possible, telephone interviews, face-to- face interviews, and mail surveys can be used simultaneously to improve the validity of the questionnaire. A reliable and valid questionnaire should produce similar data regardless of the data collection methods. The sample size of the pilot survey test is determined by the number of items of the selected scale or instrument. From the factor analysis perspective, each item of a scale requires the minimum number of five subjects. A 20-item scale would require a sample of at least 100 subjects from each participant group for a reliability validity evaluation via Cronbach’s α analysis and factor analysis. Random sampling is always a desirable method to draw a sample, but it is often not feasible. Therefore, the research team should make every effort to recruit individuals from diverse backgrounds of the target population to participate in the pilot test. With the data from the pilot survey, researchers can use different statistical approaches to assess the cross-cultural validity and reliability of the research instruments as discussed in Chapter 4 and Chapter 5. Due to the nature of the dataset that we used to illustrate the process of instrument development in the following, we will will only demonstrate the applications of some basic statistical techniques to evaluate the validity and reliability of a newly developed empathy scale.

APPLICATION OF CROSS-CULTURAL INSTRUMENT DEVELOPMENT: MEASURING EMPATHY In the last three years, the PhD students at Boston College School of Social Work, who took a class on cross-cultural measurement with the

149

Developing New Instruments 149

first author of this book, were required to participate in a semester-long exercise to conceptualize and develop a scale to measure empathy that can be used across cultural and linguistic populations. Although this example did not strictly follow the process presented in this chapter, we were to implement all important procedures of cross-cultural instrument development. Fortunately, the students came from different cultural and linguistic backgrounds including Chinese, English, Korean, Iranian, Spanish, and Turkish. We chose empathy because it appears to be an important construct in the social work profession (King, 2011). It seems logical to think that individuals who choose a career or profession in social work have a unique worldview and perhaps a natural ability to be empathetic with others. The students were asked to review the literature to find a meaningful definition of the construct empathy. Due to the constraints of the classroom environment and time limitations, the students were instructed to follow six major phases: 1. Defining empathy 2. Evaluating the definition of empathy 3. Developing measurement indicators/items for empathy 4. Developing an online pilot survey 5. Colleting data, and 6. Evaluating Validity and Reliability of the empathy scale Phase 1: Defining Empathy The initial step was to generate a definition of social work empathy. Students were asked to review the literature to be familiar with the concept of empathy (Gerdes & Segal, 2009). After a brief review of the literature, they were assigned to groups based on sex, race/ethnicity, and nationality. Each student was asked to define empathy independently and then present it to the whole group for discussion. We also posted the definition of empathy on the Canvas platform, a digital learning management system through which faculty and students can communicate and share information with the whole class or person to person (https:// www.canvaslms.com). The students discussed the definition of empathy online and in class. Following are two examples of the definition of empathy generated by the students.

150

150

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

1. “The ability to connect with and understand another’s experience and emotions through self-reflective capacity and recognition of shared humanity.” 2. “The ability to understand and relate to another’s experience and feelings from their perspective.” Because this was a group exercise, we were able to use a focus group method to guide the students’ discussion and cognitive interview techniques to stimulate spontaneous evaluation of a definition or idea presented to the group. Phase 2: Evaluating the Definition of Empathy The students were required to conduct in-depth interviews with three to give adults aged 18 and older, on or off campus, to further refine the definition of empathy. They were instructed to ask the participants to help them define empathy based on the definition generated from the class exercise. Both students and participants were aware of the protection of confidentiality. During the class discussion, no participant’s name was mentioned. In-depth interviews helped the students to develop measurement indicators or items of empathy. Students summarized their interviews and shared with classmates to arrive to the definition of empathy as a person’s ability perceive, connect with, and understand another’s experience and emotions through conscious self-awareness and recognition of shared humanity and to appropriately respond emotionally, cognitively, and/or behaviorally. Phase 3: Developing Measurement Indicators /Items for Empathy Based on the final definition, each student was asked to develop as many items or indicators as possible that would capture the meaning of empathy as defined by the class. They posted their items online to share with other classmates. A combined list of all items from the students was presented to the whole class. Although the items were shown on the projector screen, students took turns to read aloud the items, and the whole class voted to select or omit a given item from the list. The selected items were kept in their original format or reworded. During the selection process the instructor asked the students to determine whether the items could be potentially

151

Developing New Instruments 151

biased against sex, race, language, or cultures. Students were asked to evaluate each item for language clarity in both English and in their original languages, the appropriateness of the item from their own cultural and linguistic background, the difficulty of the language used to describe each item, and the relevance of each item in measuring empathy across cultures and languages. Students who spoke other languages were asked to confirm the meaning of the items as acceptable and understandable in their first language and culture. All items of the empathy scale were Liker format, with higher scores referring to higher levels of empathy. We will name this scale the Boston College Social Work Student Empathy Scale (BCSWSE) to acknowledge all students who took SW953 at Boston College School of Social Work and participated in this scale development exercise.

BCSWSE SCALE: VERSION 1 Empathy refers to a person’s ability to perceive, connect with, and understand another’s experience and emotions through conscious self- awareness and recognition of shared humanity and to appropriately respond emotionally, cognitively, and/or behaviorally. Instruction: Choose the option that best reflects your answer.

1 = Never, 2 = Sometimes, 3 = Often, 4 = Always *We list the variable names and numbers as they appear on the dataset and the statistical results. Affective Empathy (Factor 4) q7. It upsets me to see someone being treated disrespectfully. q8. I am affected when I see someone in pain or upset. q9. When I see people succeed, I share their happiness. q10. I am happy when I see people’s lives improve. Cognitive Empathy (Factor 1) q11. I am aware of my emotions. q12. I notice when my friends are scared. q13. I know when someone is upset. q14. I notice when someone is having a good day. q15. I notice when someone is interested in what I am saying.

152

152

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

Sympathetic Empathy (Factor 2) q16. I imagine things from others’ perspective. q17. I try to understand what others are feeling. q18. W hen I see someone being treated unfairly, I think of ways I could help.**(Active Empathy) q19. I am able to understand others’ experiences. q20. My experiences help me to understand others. q21. People say I am a good listener. (Omitted) Active Empathy (Factor 3) q22. I take action to help those who are experiencing hardship (e.g., sign a petition, volunteer, or donate). q23. When someone is lost and needs directions, I offer help. q24. W hen I hear people make offensive jokes, I speak up even if they are not referring to me. q25. When someone is in distress, I offer help.

Phase 4: Developing an Online Pilot Survey With the final list of empathy items, the students designed an online survey to collect pilot data to evaluate the reliability and validity of the empathy scale. There are a few online survey software programs that can be used for data collection, but the class used Qualtrics because Boston College supports this software. Qualtrics is a Web-based survey software tool that can be used to collect data worldwide (Qualtric, 2014). The students used e-mail and social media connections to recruit participants for the class exercise. Recently social scientists have being using Amazon’s Mechanical Turk (mTurk) platform to collect data. Researchers can design the survey questionnaire via Qualtrics and administer the survey using the mTurk platform (Francis-Tan & Mialon, 2015; Mason & Suri, 2012). Other online mechanisms such as REDCap™ (Research Electronic Data Capture; Harris et al., 2009) can be used to collect cross-culural research data. REDCap™ is a secure Web-based application for building and managing online surveys and databases. It is an ideal tool to use for complex research studies that require data collection across multiple time points and/or multiple sites. The REDCap™

153

Developing New Instruments 153

system was developed by a multi-institutional consortium initiated at Vanderbilt University. Currently, the consortium is composed of 1,792 institutional partners in 98 countries. There is no cost associated with participating in the consortium, and most institutions do not charge their end users to use the application. However, depending the institutions’ resources they may charge a small fee to provide their end users with on-site technical assistance. There are several key features that make REDCap™ an attractive tool. 1. Security. Data can be entered anywhere in the world with secure Web authentication, data logging, and SSL encryption. It can be installed to meet compliance with standards such as HIPAA, 21 CFR Part 11, and FISMA (low, moderate, and high). 2. Multisite access. Data can be entered by researchers from multiple sites and institutions. 3. Electronic data entry allows for easy audit trails. 4. Automatic follow-up data collection reminders can be activated to support project management. 5. Databases and surveys can be quickly developed and customized for studies needs, with advanced question features such as autovalidation, branching logic, and stop actions. 6. Surveys can be saved as PDFs and printed to collect responses offline. Web or email links to the surveys can also be generated. 7. Data comparison functions allow for double data entry and/or blinded data entry. 8. Data can be easily exported into a number of common data analysis packages including Microsoft Excel, STATA, SAS, SPSS, or R, with automatic data dictionary generation. 9. Currently the application is available in English, Chinese, French, German, and Portuguese. Although REDCap™ is available without charge, its software is not open source. Before an institution can join the REDCap™ consortium, a license agreement between the institution and Vanderbilt University must be obtained. Readers who work at institutions that do not currently belong to the REDCap™ consortium should reach out to their institutional research office for additional guidance. To learn more visit http://project-redcap.org/.

154

154

Developing Cross-Cultural Measurement in Social Work Research and Evaluation

Our group did not use this mTurk or REDCap™ but would recommend that future social work researchers explore these online data collection platforms for cross-cultural research. The pilot survey included basic demographic variables such as age, sex, education, and seven items from a prosociality measure to cross- validate the findings from the social work empathy scale (Wilson, O’Brien, & Sesma, 2009). This prosociality scale was designed to measure human altruistic behavior, and the content of the selected items suggests that this scale correlates well with the BCSW953 Empathy Scale. Wilson defines and explain prosociality as follows: “We prefer the term ‘prosocial’ to ‘altruistic’ because it focuses on other-and society- oriented behaviors while remaining agnostic about the degree of individual self-sacrifice that might be involved. Thus, an individual who routinely does favors for others or who agrees with the survey item ‘I am helping to make my community a better place’ qualifies as prosocial, regardless of the degree of self-sacrifice involved.” (Wilson, O’Brien, & Sesma, 2009, p. 192) The selected items of the prosociality scale are (1) I am sensitive to the needs and feelings of others, (2) I think it is important to help other people, (3) I am trying to help solve social problems, (4) I am developing respect for other people, (5) I am encouraged to help others, (6) I tell the truth even when it is not easy, and (7) I am helping to make my community a better place. The survey was designed for the participants to complete within a short period of time (10 to 20 minutes). Phase 5: Collecting Data Due to the nature of a class project, the data were collected via a combination of different convenient sampling techniques. The students were asked to distribute the empathy questionnaire online using e-mails, snowballing to send the survey Web link to friends and acquaintances and requesting them to forward the link to their friend networks. Other social media platforms (e.g., personal and group Facebook pages, Twitter) and professional social media (e.g., LinkedIn) were also used. The students were asked to distribute to the survey to individuals in the United States and in their country of origin, such as China, Turkey, and Mexico. As such, the sample was collected using convenient sampling methods. The survey was available online for more than 10 weeks.

155

Developing New Instruments

Phase 6: Evaluating Validity and Reliability of the Empathy Scale After a series of descriptive analyses to screen the quality of the survey and the distribution of the empathy items, we performed exploratory factor analysis using STATA 14 to identify the dimensions of empathy and to establish the factorial structural or factorial validity of the newly developed measure. EFA Analysis We began our analysis with no specification for the number of factors underlying the measure of empathy. The initial results were not clear. We then performed a 2-factor solution analysis, a 3-factor solution analysis, and finally a 4-factor solution analysis. The results favored the 4-factor solution analysis as follows: . factor q7-q25, ipf factor(4) (obs=518) Factor analysis/correlation Method: iterated principal factors Rotation: (unrotated)

Number of obs = 518 Retained factors = 4 Number of params = 70

Factor

Eigenvalue

Difference

Proportion

Cumulative

Factor1 Factor2 Factor3

5.82638 0.98719 0.81721

4.83918 0.16999 0.23990

0.7098 0.1203 0.0996

0.7098 0.8301 0.9297

We report only a few lines from the results here to illustrate the STATA syntax and the initial results of our analysis. Next we decided to perform the Promax rotation, an oblique rotation with the Horst option, to produce standardized estimates for our factor loadings. We also used the cutoff factor loading at .30 to produce clearer results. STATA suppresses all factor loadings less than .30 and omits items from the analysis when the factor loading is less than .30. In this analysis, Item q23, “When someone is lost and needs directions, I offer help,” was omitted from the final results due to its poor factor loading on any of the four factors. We named these four factors as Affective Empathy, Cognitive Empathy, Sympathetic Empathy, and Active Empathy.

155

156

156

Developing Cross-Cultural Measurement in Social Work Research and Evaluation . rotate, promax horst blanks(.3) Factor analysis/correlation Method: iterated principal factors Rotation: oblique promax (Kaiser on) Factor

Variance

Proportion

Factor1 Factor2 Factor3 Factor4

4.20432 4.17812 4.10423 3.63298

0.5122 0.5090 0.5000 0.4426

Number of obs = 518 Retained factors = 4 Number of params = 70 Rotated factors are correlated

LR test: independent vs. saturated: chi2(171) = 3218.31 Prob>chi2 = 0.0000 Rotated factor loadings (pattern matrix) and unique variances

Variable q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q17 q18 q19 q20 q21 q22 q23 q24 q25

Factorl

0.4579 0.6828 0.7506 0.6306 0.5393

Factor2

0.6093 0.6198 0.6884 0.4901

(blanks represent abs(loading)