ANOVA (One-Way)
- Understand the null hypothesis in terms of Type I and Type II
errors
- Learn how to perform and interpret an ANOVA test for significance
Types of Error, the null hypothesis and statistical
significance
- Researchers often distinguish Type I and II
errors.
- Type I occurs when we conclude that there is a relationship
between two variables when there is actually none (a false positive).
- Type II occurs when we conclude that there is no relationship
between two variables when one exists in reality (a false negative).
|
REALITY |
No
Relationship |
Relationship |
ANALYTICAL CONCLUSION |
No
Relationship |
a |
Type II Error
|
Relationship |
Type I Error |
a |
- Researchers routinely take as the basis for their work the
null hypothesis of no relationship between variables.
- Measures of significance are used to rule out the null
hypothesis and avoid making Type I errors.
- A significance or probability level indicates the
percentage chance of making a Type I Error.
- Probabilities of .05 and less are conventionally taken
as grounds for ruling out the null hypothesis and concluding a relationship
does exist.
One-way ANOVA
- ANOVA: ANalysis Of VAriance.
- Like a T-test, Anova is used to see whether the mean of the dependent
variable differs across categories of an independent (group)
variable.
- The independent variable used in an ANOVA test can have 3 or more
categories and may be nominal or ordinal. The dependent variable is
ideally an interval variable, but can also be ordinal, particularly with many
values.
- Anova produces an F statistic which like T measures the ratio of
between-group variance to within-group variance.
- The higher the value of F, the more likely the difference between the
means is significant, i.e., not due to chance.
- Again like T scores, an F score is compared to a probability distribution
to arrive at the probability (p) value.
- Probability levels for F are interpreted in the same way as those for T.
- Select a dataset and hypothesize a relationship between two appropriate
variables
- e.g., Political ideology (dependent) varies according to the geographic
region (group) in which an individual resides.
- The independent (group) variable is preferably nominal, although it can
be ordinal.
- The dependent variable should ideally be interval, although you can use
an ordinal variable with multiple categories.
- Use Webstats to perform Frequency runs for each of the variables to
identify missing values and recodes. For the group variable, make note
of the lowest and the highest relevant value.
- Set the Type of Analysis to "Oneway ANOVA" and hit Proceed
- Select the group variable and enter the lowest value as the 'first' group
value. Then enter the highest value as the 'second' group value.
- Select the dependent variable and indicate any missing values.
- Finally, enter any recodes (if necessary) and hit Run.
- Based on the output, determine whether differences in the means on the
dependent variable across the categories of the independent (group) variable
are due to sampling error, or are actually representative of the population.
Make this judgment based on whether the significance level for the ANOVA test
is below .05.
- Repeat the steps above until you find a pair of variables that yield
significant results for the ANOVA test.
- Dataset:
- Group (independent) Variable:
- [Q1090] Region
Valid Cum
Value
Label
Value Frequency Percent Percent
Percent
East
1.00 732
22.4 22.4
22.4
South
2.00 773
23.7 23.7
46.1
Midwest
3.00 1055
32.3 32.3
78.5
West
4.00 702
21.5 21.5
100.0
------- -------
-------
Total 3262
100.0 100.0
Valid cases
3262 Missing
cases 0
- Dependent Variable:
- [Q1005] Political Ideology
Valid Cum
Value
Label
Value Frequency Percent Percent Percent
Very
conservative
1.00 329
10.1 10.3 10.3
Fairly
conservative
2.00 906
27.8 28.5 38.8
Middle of
the road
3.00 1195
36.6 37.6 76.4
Fairly
liberal
4.00 572
17.5 18.0 94.4
Very
liberal
5.00 179
5.5 5.6 100.0
Decline to
answer
-9.00
21 .6 Missing
Not
sure
-8.00
60 1.8
Missing
------- -------
-------
Total 3262
100.0 100.0
Valid cases
3181 Missing cases
81
- Hypothesis Arrow Diagram:
- Region à
Political Ideology
- Syntax
- get
file="/homes/josephf/webstats/CCFRpop.sav".
missing values Q1005
(-8,-9).
oneway Q1005 by
Q1090 (1,4)
/ranges=scheffe
/statistics=all.
- Syntax Legend
- Missing values are specified as usual
- The oneway (anova) command list the dependent
variable first followed by the independent.
- the low and hi values of the independent
variable must be specified
- the optional /ranges=scheffe produces a
graphic indicating where significant differences lie
- the optional /statistics=all
produces a variety of useful statistics
- Output
- - - - - O N E W A Y - - - -
-
Variable
Q1005 Q1005 Political views
By
Variable Q1090 Q1090
Region
Analysis of
Variance
Sum of
Mean
F
F
Source
D.F. Squares
Squares Ratio
Prob.
Between
Groups
3
25.2964
8.4321 7.9768 .0000
Within
Groups
3177
3358.3421
1.0571
Total
3180
3383.6385
Standard Standard
Group
Count Mean
Deviation Error 95 Pct Conf
Int for Mean
East
713 2.8962
1.0211 .0382 2.8211
TO
2.9713
South
753 2.8300
.9636 .0351 2.7611
TO
2.8989
Midwest
1027 2.6758
1.0774 .0336 2.6098
TO
2.7417
West
688 2.8561
1.0285 .0392 2.7791
TO
2.9331
Total
3181 2.8007
1.0315 .0183
2.7648 TO
2.8366
Fixed
Effects Model
1.0281 .0182
2.7649 to
2.8364
Random Effects
Model
.0524 2.6341
to 2.9673
Random Effects Model - estimate
of between component variance
9.365E-03
GROUP
MINIMUM
MAXIMUM
East
1.0000
5.0000
South
1.0000
5.0000
Midwest
1.0000
5.0000
West
1.0000
5.0000
TOTAL
1.0000 5.0000
Levene Test for
Homogeneity of Variances
Statistic
df1 df2 2-tail
Sig.
8.0590
3 3177
.000
- - - - - O N E W A Y - - - - -
Variable Q1005 Q1005 Political
views
By
Variable Q1090 Q1090
Region
Multiple Range
Tests: Scheffe test with significance level .05
The difference between two means is significant
if
MEAN(J)-MEAN(I) >= .7270 * RANGE * SQRT(1/N(I) +
1/N(J))
with the following value(s) for RANGE:
3.96
(*) Indicates
significant differences which are shown in the lower
triangle
M
i
d S
w o W
E
e u e
a
s t s
s
t h t
t
Mean
Q1090
2.6758
Midwest
2.8300
South *
2.8561
West *
2.8962
East *
- Interpretation
- The highlighted figures are the most important ones us.
- F Probability: This
is the most important figure of the ANOVA analysis. It is .000 for the
present analysis, meaning that there is less than 1 in 1000 chances that the
observed mean differences in ideology are due simply to sampling error.
Thus, there is a significant difference in political ideology across the
regions.
- Means: These
figures show the mean liberalism score for respondents in each region with
East ranked highest and Midwest lowest.
- Confidence Intervals:
These ranges show the confidence intervals around each mean score. Using
these we see precisely where the significant differences lie. Note that the
mean score for the Midwest is out of confidence interval for each of the
other regions.
- The Confidence Intervals
are used in plotting the results of the Scheffe test.
- The Scheffe test uses
asterisks on a grid to indicate which pairs of groups differ significantly
from one another making it easier to see that the Midwest mean differs
significantly from those for the other three regions, but those three
do not differ significantly among themselves.
- Did you find a significant result? If so, what is the likelihood
that you are making a Type I Error?
- How does the Independent Sample t-test differ from the ANOVA test?
- Recall that the measure of significance represents the likelihood of
making a Type I error. So if sig.=.03, then the likelihood that you are
making a Type I error by concluding that there is a relationship, when there
is actually none, is 3 in 100.
- The T-test compares only the mean for two categories of the group
variable, while ANOVA compares the mean across all the categories of the group
variable. Also, the group variable can be either nominal or ordinal for
the ANOVA test.
- The F-score is calculated as the ratio
of between group to within group variances using the figures in the Mean
squares column of the first table. Thus 8.4321 divided by 1.0571 =
7.9766. This figure is compared to a sampling distribution for F-scores
to determine significance. It indicates the number of standard deviations this
difference lies from the mean of the sampling distribution. Since roughly two
standard deviations (1.96) comprises 95% of the cases it forms the cut off for
the .05 significance level. The F-score here approaches 8 and thus easily
passes significance at the .05 level.
- Standard Errors and confidence
intervals are calculated as they were in the previous lab.
- Thus Standard Errors again = square root of the variance divided by the
square root of n (the number of cases for the group). Remember: variance =
SD2, so for East this yields 1.02/26.7=.038.
- The 95%CI for Mean column is calculated by subtracting and
adding the 1.96 times the standard error to/from the mean score. Since
for the East .038(1.96) =.074, then (2.89 - .07) = 2.82 while (2.896 + .074) =
2.97.
- Derivation of the Sum of Squares and Mean Squares will be discussed in the
second term in connection with regression and two-way anova.