Ling 423/640G: Cognitive Linguistics           

Ben Bergen

 

Meeting 10: Statistics

September 25, 2008

 

WARNING: This outline is meant to be used only as a preliminary, orientational resource for students and other researchers working on questions in quantitative linguistics. Mastery of its contents alone does not necessarily suffice to perform professional-grade statistics, so please consult other resources [like the reading for today, or this: http://davidmlane.com/hyperstat/index.html] before proceeding with work to be presented publicly. For a more detailed version of this document, look under the links at: http://www2.hawaii.edu/~bergen/lcl/

 

Inferential statistics

 

Inferential statistics are statistical tests you apply to quantitative data in order to determine the likelihood that the results you observe are due to chance, or instead whether they are statistically significant, meaning that they can be generalized to a larger population. We will look at three classes of test: (1) chi-square and Fisher's exact tests, (2) regression, and (3) t-tests and ANOVA.

 

Throughout, you will mostly be trying to determine if an effect you have measured is significant or not, by looking at the p statistic, which tells you the probability that the distribution [actually, the distribution or any less likely distribution] is due to chance. If p is less than 0.05 (i.e. 1 in 20 odds that it was chance), this indicates that the distribution is unlikely to have been produced by chance, and is usually taken as a significant result.

 

Preliminaries

 

To pick a statistical method, you need to minimally know the following: [1] your dependent variable[s], [2] your independent variable[s], [3] whether each of these is treated as continuous or categorical, [4] how many of each type of variable do you have, and [5] how many categories [levels] in each categorical variable.

 

Chi-Square and Fisher's Exact

 

The simplest case is when you have a single categorical dependent variable and a single categorical independent variable. In this case, the question you're asking is whether there is a significant difference in the category distributions of the dependent variable, in the different levels of the independent variable. For example, when people are drunk or not [categorical independent variable], do they say "officer" or "occifer"? [categorical dependent variable]? You can only use a chi-square test if you have at least five times the number of observations as the total number of cells. So if you have an independent variable with two levels and a dependent variable also with two levels, that means you have 4 cells, which means you need at least 20 observations, otherwise the method won't work. This is a bare minimum - you'll usually need many more observations. Fisher's Exact can be robust with fewer observations, but it's not advisable to use it in these cases.

 

Use chi-square... if you have more than two levels [categories] in either variable. For example, if you have three possible conditions of the independent variable, e.g. drunk, on speed, neither, then use chi-square.  http://www.physics.csbsju.edu/stats/contingency_NROW_NCOLUMN_form.html

 


Use Fisher's exact... if you have exactly two levels (categories) in both variables. http://www.quantitativeskills.com/sisa/statistics/fisher.htm

 

You're interested in the p statistic for the two-tail p-value.

 

For both of these, it doesn't matter if you put the independent and dependent variables in rows or columns, just be consistent.

fisherexact.tiff


 

Regression

 


When the independent and dependent variables are all continuous, use linear regression. Linear regression attempts to explain the relationship between these two variables with a straight line fit to the data. To get an intuitive idea of how regression works, go here: http://www.mste.uiuc.edu/activity/regression/

 

To perform regression, use any statistics program (in the LAE labs we have SPSS), or this: http://www.wessa.net/slr.wasp

 

                     p < 0.0001


You're interested in the significance of the p value in the accompanying ANOVA

 

T-test and ANOVA

 

When you have a continuous dependent variable and categorical independent variable(s), use a T-test or ANOVA. Most of the studies we're looking at in this course have used one of these. These tests will tell you if there is a significant difference in means of a continuous dependent variable given the different levels of the categorical independent variable, or combinations of levels of multiple independent variables.

 

If you have more than one independent variable or your independent variable has more than two levels, then perform ANOVA. Otherwise, you can use a T-Test. You also need to know whether your independent variables are between- or within observations. For any analysis in which at least one independent variable is within-subjects or items, you have to use a Paired T-Test or Repeated-Measures ANOVA. In other cases, use an unpaired T-Test or Univariate [Factorial] ANOVA.

 

To perform T-Tests, use this: http://www.physics.csbsju.edu/stats/t-test.html . To perform univariate ANOVA, use this: http://www.physics.csbsju.edu/stats/anova.html; to perform Repeated-Measures ANOVA, use SPSS in the labs. You're looking for the p value once again.