POL242 LAB MANUAL:  EXERCISE 10

Logistic Regression

Task 1: Explain support for gay marriage
Task 2: Explain the Canadian Vote

PURPOSE

RATIONALE

Ordinary Least Squares Regression is inappropriate for explaining dependent variables that are dichotomous. Look at the example below: all of the observations are at two places on the Y axis, at one and at zero. This is real data. The X-axis is the median income of American states. The Y-axis is one if the state voted for Bush in 2000, zero if the state voted for Gore. To simplify, I include only the ten poorest states and the ten richest states. As you can see, almost all of the poorest states (including Mississippi, Arkansas and Gore's home state, Tennessee) voted for Bush, but most of the rich states voted for Gore.

RegressionLineExample When you try to fit any regular, linear regression line (see chart on right), you cannot do a very good job of  minimizing the errors, so the fit is very poor.
Notice what the line predicts should be the value of Y for when X=50,000.
A regression line predicts that, for most levels of X (median income), Y should be between one and zero. It is also not uncommon for a regression line to predict values less than zero or higher than one.

Of course, this cannot happen: a state either votes for Bush (1) or votes for Gore (0).  As a result, the distance between almost all points and the line is quite large. The distance is even larger for those points that seem to be "wrong" - notice how far away from the regression line is the point for New Mexico, the poorest state to vote for Gore (in the bottom-left of the chart).
Instead, we fit an S-shaped, curved line to the data (see chart on left). (This line has a negative slope, so it looks more like a Z, but the positive slope looks more like the letter S).

This line does a much better job of minimizing the errors. The band of values of X for which  Y should be between one and zero is minimized. By rounding up values predicted by the S-curve to be over 0.5 (by default) Webstats will tell us how many observations we correctly predicted. So, we can interpret the data in terms of increasing the odds, chances or probability that the choice is one (a vote for Bush), rather than the predicted value for a given level of X. While this interpretation is less-straightforward, it is more appropriate for modelling dichotomous choices.

There are two types of similar S-curves used to analyze these data, logit and probit. The two are very similar. Here, we will discuss logit analyses.

We will only discuss modeling choices between dichotomous variables. There are ways of analyzing more than two choices. We will not be able to discuss either method in this class, but you should be aware of their existence. These methods bridge the gap between the smallest number of intervals that regression lines can be applied to and these dichotomous choice forms. These include ordered logit (or probit), which basically fits multiple S-curves like steps on ordinal dependent variables such as the four-point agree-disagree scale used in the unrecoded gay marriage variable below, and multinomial logit (or probit) which enables scholars to use categorical dependent variables with multiple, but unordered variables, like a question that asks whether the respondent intends to vote for the Liberals, the Conservatives, and the NDP in Ontario.


INSTRUCTIONS

The first thing we need to do is recode our dependent variable(s) to make dichotomous dummy variables. For this data lab, we will use two variables on the CRIC 2002 study, one that measures support for more immigration, and one that gauges support for gay marriage.

Immigration Policy
6. Do you think Canada should accept more, fewer or about the same number of immigrants as we accept now?
There are three possible answers (more, fewer, or about the same). We will combine more and about the same, to gauge support for continuing Canada's [permissive] immigration policy.

Gay Marriage
7. Do you strongly support, support, oppose or strongly oppose allowing gay and lesbian couples to marry?
There are four possible answers. To turn this into a dummy variable, we will combine support and strongly support, and oppose and strongly oppose. A value of one will indicate support for gay marriage, zero opposition.

***Label! It is important to make new value labels for any dependent variable when you run a logistic regression on WebStats or SPSS, as you will see below.***

Example 1:  Immigration Policy
Right click here to open a new window to see the output for the immigration policy example.

At the top of the page, there 's useful information. You see the number of cases used in the analysis, so you can make sure you remembered to get rid fo the missing values. Then you can check that you coded the dependent variable properly, and it tells you what the dependent variable is (in this case, the question on support for immigration). All of this is useful, but little is important or newsworthy. Skip down a few lines and you will see a list of the independent variables.


For this example, I used:
Q3   Are you worried about you or a member of your family finding or keeping a stable, full-time job?
Q38 Over the next decade, do you think that it is likely that we [Canada] will reduce prejudice against ethnic and racial minorities?
Q42 What is the highest level of education you have reached?
Q45 Total annual income
I  expect that those who are worried about losing their jobs will be less supportive of admitting immigrants. I also expect that those who think that Canada will not reduce prejudice against against ethnic and racial minorities will not support admitting immigrants (most of whom are minority group members). However, I expect the educated and the wealthy to support opening the border.

Below the list, there are some numbers, labeled "goodness of fit," "model chi-square" and others, that for now, you can disregard. The important information is at the very bottom of the page, the Classification Table and the Variables in the Equation List.

The Variables in the Equation List should remind you of the multiple regression output. It is very similar. On the far left, you find a columnar list of variables used in the analysis. Along each row is information about the size and direction of the coefficient, as well as whether or not the independent variable is significant.

The third column from the right, highlighted in blue is labeled "Sig" and indicates the significance of the variable coefficient. Just like with multiple regressions, this tells you whether the variable had an effect on the value of Y. If you observe a number close to 0.000, you can be confident that the independent variable has a significant effect on the dependent variable. Q38 (reduce prejudice) and Q42 (education) are clearly significant. Q3 (worried about losing job) is significant at the less strenuous standard of (p<0.5). With a P= .6390, Q45 (income) is not significant.

In the second column, you can view a "B" coefficient. This is like the regression coefficient. Larger numbers reflect a bigger chance per unit change in the independent variable than smaller numbers. Even though you cannot interpret the magnitude of the change the same as you would a regression coefficient, you can tell whether that variable increases the chances of supporting a permissive immigration policy by looking at  sign of the coefficient. If the coefficient is positive, a high value for the variable increases the chances of supporting the policy. For example, look at the B for education,  .4161. As we expected, those who are more educated, support allowing more or the same number of immigrants. In contrast, the negative coefficient for Q3, -.1430, indicates that the more the respondent is worried about losing their jobs, the more likely the respondent will support decreasing the number of immigrants to Canada.

Above the Variables in the Equation List is the Classification Table. The columns show the predicted Ys, while the rows are the actual, observed Y in the data. Each row and each column is labeled according to how the variable is labeled (which is why it is important to label the dependent variable). Look at the bottom-right corner: There are 549 observations correctly predicted to be in favor of a permissive immigation policy. A further 147 observations in the blue cell in the top-left, indicating that these observations were correctly predicted to be people who think that there should be fewer immigrants let into the country. The other two cells of this table are the observations that the model incorrectly predicted. The best models have many observations in the two blue cells, with few in the red cells. This model incorrectly classified too many as being in favor of a permissive immigration policy, when in actuality, they oppose more immigrants.

In total, this model correctly classified 60.05% of the observations. Good models should be better than tossing a coin, or 50% of the observations. Good models should also exceed the naive model. The naive model is the percent of observations explained if you always guess 'one' or 'zero' (whichever is the most common). Since about 58% of the actual observations are in favor of Canada's immigration policy or making it more permissive, this model only slightly improves on the naive model.

Above the classification table are some goodness of fit measures. The Model Chi-Square is useful when comparing models with a set a variables to another model with the same set of variables, plus a few more. The difference in chi-squares indicates whether these additional variables improved the model.

Example 2 and Task 1: Gay marriage
Now it is your turn. Right click here to open a new window to see a second example, using Q6, support or opposition to gay marriage. I have done much of the work for you. At the top of the page is the syntax to run an analyis using three dependent variables, level of education, age, and religiosity.

Look at the Variables in the Equation List.
1) Are all three variables significant at P < 0.1? P < 0.5?
2) What sign are the coefficients? Which variable(s) increase the odds of supporting gay marriage? Which variable(s) decrease the odds?

Look at the
Classification Table.
3) What percentage of the observations did the model correctly predict? Does this percentage exceed the naive model?
4) Where is the model weak? In other words, where is the model over-predicting or under-predicting the most numerous observations?

Now, try to improve the model.
Cut and paste this syntax into WebStats so you can run the existing models. Go to the CRIC 2002 codebook, and add at least two independent variables to the model, or add one new independent variable and one interaction term. Try to focus on correcting the model's biggest weakness. Remember to recode the variables and get rid of missing values. For those of you who have not worked with this data set before, a student in last year's class who wrote her final paper on gay marriage, Anna Saini, suggested that you might have good results if you used two variables from the following list (but you can try any variables you'd like):


Language (French vs. English)
Gender
Q9 Importance of obedience of authority
Q13 Opinion of Charter
Q15A Opinion about equality
Q15C Entitlement to legal rights and protection
Q15F Respect for authority
Q16A Charter goes too far protecting the rights of minority groups
Q45 Income

After re-running the analysis with these new variables, answer the following questions:
1) Did each of the variables you added have a significant effect on support for [or opposition to] gay marriage?
2) Did high values of your variables increase the chance of supporting gay marriage, or decrease chance of supporting gay marriage?
3) How well did your new model do at predicting support for, and opposition to gay marriage? Did you improve upon the original model?

To answer this last question, you should look at the Classification Table, but you can also compare the Model Chi-Square.

Task 2: Explain the Canadian Vote.
The problem with the CRIC data is that the number of possible independent variables are somewhat limited. So, for the second task, turn to another dataset, the CNES Canadian National Election Survey. As you can see by looking at the codebook found at http://www.fas.umontreal.ca/pol/ces-eec/ces.html, there is no shortage of variables in this dataset.

The dependent variable you will use is A4, the respondent's vote intention. To perform a logistic regression, you must recode this variable into a dichotomous dummy variable, with one taking the value of the party of your choice and zero being all other parties.

Choose at least three independent variables, or two independent variables and an interaction term. You may find Section M, which contains demographics, to be especially useful to find independent variables. Remember to recode turn all unordered categorical variables into dummy variables. Run a logistic regression. After running the regression, be ready to answer:

Look at the Variable Equation List.
1) Are all your variables significant at P < 0.1? P < 0.5?
2) What sign are the coefficients? Which variable(s) increase the odds of voting for the party of your choice? Which variable(s) decrease the odds?

Look at the
Classification Table.
3) What percentage of the observations did the model correctly predict? Does this percentage exceed the naive model?
4) Where is the model weak? In other words, where is the model over-predicting or under-predicting the most numerous observations?


Be ready to share your findings with the rest of the class.