POL242 LAB MANUAL: EXERCISE 5B

ANOVA (One-Way)

PURPOSE

Understand the null hypothesis in terms of Type I and Type II errors
Learn how to perform and interpret an ANOVA test for significance

MAIN POINTS

Types of Error, the null hypothesis and statistical significance

Researchers often distinguish Type I and II errors.
Type I occurs when we conclude that there is a relationship between two variables when there is actually none (a false positive).
Type II occurs when we conclude that there is no relationship between two variables when one exists in reality (a false negative).

		REALITY
		No Relationship	Relationship
ANALYTICAL CONCLUSION	No Relationship	a	Type II Error
ANALYTICAL CONCLUSION	Relationship	Type I Error	a

Researchers routinely take as the basis for their work the null hypothesis of no relationship between variables.
Measures of significance are used to rule out the null hypothesis and avoid making Type I errors.
A significance or probability level indicates the percentage chance of making a Type I Error.
Probabilities of .05 and less are conventionally taken as grounds for ruling out the null hypothesis and concluding a relationship does exist.

One-way ANOVA

ANOVA: ANalysis Of VAriance.
Like a T-test, Anova is used to see whether the mean of the dependent variable differs across categories of an independent (group) variable.
The independent variable used in an ANOVA test can have 3 or more categories and may be nominal or ordinal. The dependent variable is ideally an interval variable, but can also be ordinal, particularly with many values.
Anova produces an F statistic which like T measures the ratio of between-group variance to within-group variance.
The higher the value of F, the more likely the difference between the means is significant, i.e., not due to chance.
Again like T scores, an F score is compared to a probability distribution to arrive at the probability (p) value.
Probability levels for F are interpreted in the same way as those for T.

INSTRUCTIONS

Select a dataset and hypothesize a relationship between two appropriate variables

e.g., Political ideology (dependent) varies according to the geographic region (group) in which an individual resides.
The independent (group) variable is preferably nominal, although it can be ordinal.
The dependent variable should ideally be interval, although you can use an ordinal variable with multiple categories.

Use Webstats to perform Frequency runs for each of the variables to identify missing values and recodes. For the group variable, make note of the lowest and the highest relevant value.
Set the Type of Analysis to "Oneway ANOVA" and hit Proceed
Select the group variable and enter the lowest value as the 'first' group value. Then enter the highest value as the 'second' group value.
Select the dependent variable and indicate any missing values.
Finally, enter any recodes (if necessary) and hit Run.
Based on the output, determine whether differences in the means on the dependent variable across the categories of the independent (group) variable are due to sampling error, or are actually representative of the population. Make this judgment based on whether the significance level for the ANOVA test is below .05.
Repeat the steps above until you find a pair of variables that yield significant results for the ANOVA test.

EXAMPLE

Dataset:

CCFRpop

Group (independent) Variable:

[Q1090] Region

                                                       Valid     Cum
Value Label                 Value Frequency Percent Percent Percent

East                         1.00       732     22.4     22.4     22.4
South                        2.00       773     23.7     23.7     46.1
Midwest                      3.00      1055     32.3     32.3     78.5
West                         4.00       702     21.5     21.5    100.0
                                     ------- ------- -------
                            Total      3262    100.0    100.0

Valid cases    3262      Missing cases      0

Dependent Variable:

[Q1005] Political Ideology                                                         Valid     Cum
Value Label                 Value Frequency Percent Percent Percent

Very conservative            1.00       329     10.1     10.3     10.3
Fairly conservative          2.00       906     27.8     28.5     38.8
Middle of the road           3.00      1195     36.6     37.6     76.4
Fairly liberal               4.00       572     17.5     18.0     94.4
Very liberal                 5.00       179      5.5      5.6    100.0
Decline to answer           -9.00        21       .6   Missing
Not sure                    -8.00        60      1.8   Missing
                                     ------- ------- -------
                            Total      3262    100.0    100.0

Valid cases    3181      Missing cases     81

Hypothesis Arrow Diagram:

Region à Political Ideology

Syntax

get file="/homes/josephf/webstats/CCFRpop.sav".
missing values Q1005 (-8,-9).
oneway Q1005 by Q1090 (1,4)
/ranges=scheffe
/statistics=all.

Syntax Legend

Missing values are specified as usual
The oneway (anova) command list the dependent variable first followed by the independent.

the low and hi values of the independent variable must be specified
the optional /ranges=scheffe produces a graphic indicating where significant differences lie
the optional /statistics=all produces a variety of useful statistics

Output

                       - - - - - O N E W A Y - - - - -

      Variable Q1005      Q1005 Political views
   By Variable Q1090      Q1090 Region

                                  Analysis of Variance

                                  Sum of         Mean             F      F
        Source           D.F.    Squares       Squares          Ratio Prob.

Between Groups             3       25.2964        8.4321       7.9768 .0000
Within Groups           3177     3358.3421        1.0571
Total                   3180     3383.6385

                                 Standard   Standard
Group       Count        Mean   Deviation      Error    95 Pct Conf Int for Mean

East          713      2.8962      1.0211      .0382      2.8211 TO      2.9713
South         753      2.8300       .9636      .0351      2.7611 TO      2.8989
Midwest      1027      2.6758      1.0774      .0336      2.6098 TO      2.7417
West          688      2.8561      1.0285      .0392      2.7791 TO      2.9331

Total        3181      2.8007      1.0315      .0183      2.7648 TO      2.8366

          Fixed Effects Model      1.0281      .0182      2.7649 to      2.8364

         Random Effects Model                  .0524      2.6341 to      2.9673

Random Effects Model - estimate of between component variance     9.365E-03

GROUP        MINIMUM     MAXIMUM

East          1.0000      5.0000
South         1.0000      5.0000
Midwest       1.0000      5.0000
West          1.0000      5.0000

TOTAL         1.0000      5.0000

Levene Test for Homogeneity of Variances

    Statistic    df1    df2       2-tail Sig.
      8.0590      3     3177        .000

- - - - - O N E W A Y - - - - -

Variable Q1005 Q1005 Political views

Variable Q1090 Q1090 Region

Multiple Range Tests: Scheffe test with significance level .05

The difference between two means is significant if

MEAN(J)-MEAN(I) >= .7270 * RANGE * SQRT(1/N(I) + 1/N(J))

with the following value(s) for RANGE: 3.96

(*) Indicates significant differences which are shown in the lower triangle

d S

w o W E

e u e a

s t s s

t h t t

Mean Q1090

2.6758 Midwest

2.8300 South *

2.8561 West *

2.8962 East *

Interpretation

The highlighted figures are the most important ones us.
F Probability: This is the most important figure of the ANOVA analysis. It is .000 for the present analysis, meaning that there is less than 1 in 1000 chances that the observed mean differences in ideology are due simply to sampling error. Thus, there is a significant difference in political ideology across the regions.
Means: These figures show the mean liberalism score for respondents in each region with East ranked highest and Midwest lowest.
Confidence Intervals: These ranges show the confidence intervals around each mean score. Using these we see precisely where the significant differences lie. Note that the mean score for the Midwest is out of confidence interval for each of the other regions.
The Confidence Intervals are used in plotting the results of the Scheffe test.
The Scheffe test uses asterisks on a grid to indicate which pairs of groups differ significantly from one another making it easier to see that the Midwest mean differs significantly from those for the other three regions, but those three do not differ significantly among themselves.

QUESTIONS FOR REFLECTION

Did you find a significant result? If so, what is the likelihood that you are making a Type I Error?
How does the Independent Sample t-test differ from the ANOVA test?

DISCUSSION

Recall that the measure of significance represents the likelihood of making a Type I error. So if sig.=.03, then the likelihood that you are making a Type I error by concluding that there is a relationship, when there is actually none, is 3 in 100.
The T-test compares only the mean for two categories of the group variable, while ANOVA compares the mean across all the categories of the group variable. Also, the group variable can be either nominal or ordinal for the ANOVA test.

FURTHER TECHNICAL DETAILS

The F-score is calculated as the ratio of between group to within group variances using the figures in the Mean squares column of the first table. Thus 8.4321 divided by 1.0571 = 7.9766. This figure is compared to a sampling distribution for F-scores to determine significance. It indicates the number of standard deviations this difference lies from the mean of the sampling distribution. Since roughly two standard deviations (1.96) comprises 95% of the cases it forms the cut off for the .05 significance level. The F-score here approaches 8 and thus easily passes significance at the .05 level.
Standard Errors and confidence intervals are calculated as they were in the previous lab.
Thus Standard Errors again = square root of the variance divided by the square root of n (the number of cases for the group). Remember: variance = SD², so for East this yields 1.02/26.7=.038.
The 95%CI for Mean column is calculated by subtracting and adding the 1.96 times the standard error to/from the mean score. Since for the East .038(1.96) =.074, then (2.89 - .07) = 2.82 while (2.896 + .074) = 2.97.
Derivation of the Sum of Squares and Mean Squares will be discussed in the second term in connection with regression and two-way anova.