MORE ABOUT MEASUREMENT

 

Subtopics

 

 


 

Introduction

 

Data analysis is only as good as the data themselves.  Great care needs to be taken to use operational definitions that are valid and reliable measures of concepts.  In this topic, we will explore what is meant by validity and reliability, and then describe several techniques that can improve (or, if misused, weaken) our measurements.


Validity and Reliability

A measure is valid if it actually measures the concept we are attempting to measure.  It is reliable if it consistently produces the same result.  A measure can be reliable without being valid (if we are consistently getting the wrong result).   It cannot, however, be valid if it isn't reliable.  (If our measure is inconsistent, it won't produce a valid result, at least not on a regular basis.)

In a famous study published in 1955, Samuel Stouffer attempted to measure the degree of tolerance of his respondents by asking them a series of questions such as whether they would be willing to have communists give public speeches in their city, teach at a college or university, or have the local public library carry books they had authored.   Similar questions were asked with atheists and socialists substituted for communists.[1]  Stouffer in effect assumed that respondents would oppose communism, atheism, and socialism, and that willingness to put up with people from these groups would be a valid and reliable way to measure tolerance.  Subsequent research showed growing levels of tolerance, measured in this way, between the 1950s and the 1970s.

A different approach was developed by John Sullivan et al.[2]   These researchers pointed out that, if tolerance means the ability to put up with someone you do not like or with whom you disagree, then these measures are valid only for someone unsympathetic toward communists, atheists, or socialists (all groups generally seen as on the left).  They would also be unreliable over time, as different groups fell into or out of favor with the public.   Perhaps Americans had not become more tolerant, but rather less opposed to the left.  To test this possibility, respondents were given a list of ten groups, including those on both the left and right, and asked to pick the two they most disliked.  They were also encouraged to name other groups not on the list.   They were then asked about their willingness to have members of these groups teach at the local college, etc.  The researchers reasoned that, asked in this way, the questions would be more valid and reliable because they would not rely on any, perhaps incorrect, assumptions about a respondent's attitude toward any particular group, but instead would measure a respondent's tolerance of whatever groups the respondent was unsympathetic toward.  Measured this way, little change was found in tolerance over time.

Questions similar to Stouffer’s are still used in, for example, the General Social Survey.  However, in an attempt to correct for the problem just described, the General Social Survey also asks questions about respondents' tolerance for groups on both the left and the right.

Sometimes apparently similar measures produce inconsistent, and hence unreliable, results.  In 2003, California held a special election to consider recalling its governor, Gray Davis.  While most polls taken in the weeks leading up to the election showed that the recall effort was ahead by a substantial margin, the Los Angeles Times poll indicated that the race was very tight.  The reasons why the Times poll obtained such different results was hotly debated at the time[3].  In fact, there are any number of reasons why polls may be unreliable.  Among other things, results may be influenced by the precise wording and ordering of the questions asked, corrections (called "weighting") attempted for known over or under representation of some segments of the population, estimates of who is likely to vote, and even whether the poll is conducted during the week or over a weekend.  (In the end, Davis was recalled by a margin of 55 percent to 45 percent.)

A study conducted in a number of countries sought to compare differences in attitudes toward the role of government.  Respondents were asked questions such as whether they agreed that it was “the responsibility of the state to take care of very poor people who can’t take care of themselves.”  Researchers found that, in the United States, they had to substitute the word “government” for “state,” since in the U.S., “state” applies specifically to subnational governments within the country’s federal system, whereas in parliamentary systems such as Great Britain, the “government” refers to the majority party in parliament (or, very roughly, what in the U.S. is called the “administration”).[4]

So far we have been discussing what is sometimes called “internal”validity.  “External” validity, on the other hand, tests the validity of a measure by comparing results with some other measure thought to tap into the same concept.  For example, the following two questions (from the Student Survey) appear on their face to be measuring more or less the same thing:

Iraq: do you support or oppose the United States having gone to war with Iraq?

war: In today’s world, it is sometimes necessary to attack potentially hostile countries, rather than waiting until we are attacked to respond.

Examine the following crosstabulation (since we aren’t testing for any causal relationship between these two variables, it doesn’t matter which we treat as the independent variable, and no percentages have been calculated).  Note that most of those who supported preemptive war also supported the war in Iraq (and vice versa), while most who opposed the one opposed the other as well.  There were, however, a fair number of exceptions in both directions.  In short, the two variables are measuring similar, but not identical, concepts.

 


 

Exercises

 

1.         Start SPSS, and open “anes00s.sav.”  Open the American National Election Study 2000 Subset codebook.  Examine the codebook for two variables, “aidblack” and “racepref,” which can be seen as different ways of operationalizing the same basic concept.  Crosstabulate these variables.  Do respondents' answers to one question give you at least a fairly good indication of how they answered the other question?  Repeat this exercise using other variables in the file that seem to measure the same basic concept.

2.         Still using the American National Election Study Subset, examine the codebook for measures of the gender of the respondent (“gender”), the gender of the interviewer (“intgen”), and attitude toward the role of women (“womrole”)  Obtain frequency distributions for these variables.  Crosstabulatewomrole” by “gender” and by “intgen.”  Do answers to the question on women’s roles depend more on the respondent’s gender or on that of the interviewer?  Did the fact that interviewers were predominantly female influence the frequency distribution for the question on women’s roles? 

\


 

For Further Study

 

Arnold, William E., James C. McCroskey, and Samuel V. O. Prichard, 1967. “The Likert-Type Scale,” Today's Speech,15,  31-33.  Available online at http://www.jamescmccroskey.com/publications/25.htm.  Accessed October 24, 2003.

 

Fitzgerald, John, “Stems and Scales,” http://www.coolth.com/likert.htm.  Accessed: October 24, 2003.

 


 

[1] Samuel A. Stouffer.  Communism, Conformity, and Civil Liberties (Gloucester, MA: Peter Smith, 1955), Appendix B.

[2] John L. Sullivan, James Piereson, and George E. Marcus, “An Alternative Conceptualization of Political Tolerance: Illusory Increases, 1950s-1970s,” American Political Science Review (September 1979): 781-794.

[3] David Lauter, “Why Poll Results Differ,” Los Angeles Times, September 12, 2003; Mark DiCamillo and Mervin Field, “A Different Take on “’Why Polls Differ’,” Field Poll, Special Report, September 16, 2003   http://field.com/fieldpollonline/subscribers/.  Accessed December 23, 2003.

[4] Cited in Everett Carll Ladd, The American Ideology (Storrs: Conn.: Roper Center, 1994): 79-80.