Part 7 Describing the data

7.1 The summary function

The summary() function is convenient. We can describe all of the variables in a data frame using the function on our data frame object, dat:

summary(dat)
##       PID             Lik1           Lik2           Lik3        Lik4     
##  Min.   : 1.00   Min.   :1.00   Min.   :1.00   Min.   :2   Min.   :1.00  
##  1st Qu.: 3.25   1st Qu.:2.00   1st Qu.:2.00   1st Qu.:3   1st Qu.:1.25  
##  Median : 5.50   Median :3.00   Median :2.00   Median :3   Median :3.00  
##  Mean   : 5.50   Mean   :3.10   Mean   :2.70   Mean   :3   Mean   :2.70  
##  3rd Qu.: 7.75   3rd Qu.:4.75   3rd Qu.:3.75   3rd Qu.:3   3rd Qu.:3.75  
##  Max.   :10.00   Max.   :5.00   Max.   :5.00   Max.   :4   Max.   :5.00  
##       Lik5        Teacher         
##  Min.   :1.00   Length:10         
##  1st Qu.:2.00   Class :character  
##  Median :2.50   Mode  :character  
##  Mean   :2.80                     
##  3rd Qu.:3.75                     
##  Max.   :5.00

Note Notice that we have also summarized the ID variable, which is not meaningful. We would obviously not include that in a report. This is simply a quick way to look at the data. If any of the min and max values were outside of our legitimate range on any of the variables, we would need to do some data cleaning.
Also, we can see that Teacher is not a numeric variable. The results of the summary function with this variable do not return the mean, median and so forth because those would not make sense with non-numeric data. Instead, the frequency of observations in each category of this variable is reported.

We can apply the summary() function to a single variable:

summary(dat$Lik1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    3.00    3.10    4.75    5.00

We can use many functions on subsets of data, using indexing. Which rows are included in the following summary function?

summary(dat[1:3,"Lik1"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   3.667   4.500   5.000

Let’s use summary() on the Likerts object we created above.17

summary(dat[,Likerts])
##       Lik1           Lik2           Lik3        Lik4           Lik5     
##  Min.   :1.00   Min.   :1.00   Min.   :2   Min.   :1.00   Min.   :1.00  
##  1st Qu.:2.00   1st Qu.:2.00   1st Qu.:3   1st Qu.:1.25   1st Qu.:2.00  
##  Median :3.00   Median :2.00   Median :3   Median :3.00   Median :2.50  
##  Mean   :3.10   Mean   :2.70   Mean   :3   Mean   :2.70   Mean   :2.80  
##  3rd Qu.:4.75   3rd Qu.:3.75   3rd Qu.:3   3rd Qu.:3.75   3rd Qu.:3.75  
##  Max.   :5.00   Max.   :5.00   Max.   :4   Max.   :5.00   Max.   :5.00

7.2 Specific functions for summarizing data

We can also perform functions like min(), max(), median() and assign them to objects that we can later use:


Note For each of the three lines, we performed the function on the variable and we assigned the output to an object, which we’ve arbitrarily named minX, maxX, and medX.
minX <- min(dat$Lik1)
maxX <- max(dat$Lik1)
medX <- median(dat$Lik1)

Let’s view the objects in the console by typing them and running them:

minX
medX
maxX

Here’s the output:

minX
## [1] 1
medX
## [1] 3
maxX
## [1] 5

Let’s subtract minX from maxX to have a look at the range:

maxX - minX
## [1] 4

There are many functions we can use. For example, we can get the mean, variance, and standard deviation of a variable using these functions:

mean(dat$Lik1)
## [1] 3.1
var(dat$Lik1)
## [1] 2.766667
sd(dat$Lik1)
## [1] 1.66333

A word of caution in working with Likert-type data such as these is worth our attention. We should probably not estimate the mean, variance, and standard deviation with individual Likert-type variables that have fewer than five categories because these types of functions assume the variables are continuous—that is, we are assuming that the conceptual distance between any two neighboring points on the numbered scale is the same as the distance between any other two consecutively numbered points.18


  1. Remember that we do not use quotes with objects (but that we did with names of variables).↩︎

  2. Mean and variance are more appropriate with continuous, interval-level, variables than with ordinal level variables. Next in this tutorial, we create composite scores, which we can probably more comfortably treat as continuous.↩︎