Part 8 Creating new variables

8.1 Creating scores from items

If our five Likert-type questions are asking about the same concept, we can combine them into a single composite score and save that as a new variable, which we’ll call Energy in our data frame. Then we can perform functions, such as summary() and var() on that variable. It’s worth noting that after we have this composite score based on multiple Likert-type items, we have a stronger case for assuming that our variable is continuous enough for performing operations like mean(), var(), and so forth; however, some statisticians will still frown against this practice because the summation procedure still assumes equal intervals in the Likert-type item data. In practice, however, with composite variables, this usually does not result in meaningful misinterpretations (Carifio & Perla, 2008; Norman, 2010).

dat$Energy<- apply(dat[ ,Likerts], MARGIN = 1, mean)
# Let's view the data frame:
dat

##    PID Lik1 Lik2 Lik3 Lik4 Lik5 Teacher Energy
## 1    1    5    5    4    3    2      No    3.8
## 2    2    2    2    3    1    2     Yes    2.0
## 3    3    4    4    3    3    2      No    3.2
## 4    4    2    2    2    1    2     Yes    1.8
## 5    5    5    5    3    5    4      No    4.4
## 6    6    1    1    2    2    3     Yes    1.8
## 7    7    1    2    3    1    1     Yes    1.6
## 8    8    4    1    3    4    5      No    3.4
## 9    9    5    3    4    4    3      No    3.8
## 10  10    2    2    3    3    4     Yes    2.8

# Let's perform the summary function on the new variable.
summary(dat$Energy)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.60    1.85    3.00    2.86    3.70    4.40

Each person now has a score on our new variable, called Energy, which is the mean across the five Likert-type items. It looks like the average energy level was 2.86 on the 1 to 5 scale.

How did we accomplish this? Let’s break the code down so it makes sense:

Earlier we created the object Likerts using the c() function and the names of our variables of interest—the five Likert-type questions. In other words, we did this:

Likerts<- c("Lik1", "Lik2", "Lik3", "Lik4","Lik5")

Then, we subsetted our data, using indexing, with the column index consisting of our object, Likerts, which comprises the names of the five variables:

dat[ ,Likerts]

We could have achieved the same result with the c() function directly embedded in the index, as in dat[ ,c("Lik1", "Lik2", "Lik3", "Lik4","Lik5")] but it would have made our line of code longer and harder to read. Also, we can re-use the Likerts object when we perform other functions (such as head(dat[ ,Likerts])).

What is new here is the apply() function. This function has three arguments. The first one specifies the data (rows and columns) we want to apply a function to, the second argument specifies the margin we seek to apply the function to, and the third argument specifies the function we wish to have applied (mean in our case).

An argument is a statement within a function. Simple functions, like summary() and median() take a single argument. The order of arguments in a function is important so R can distinguish them; otherwise, we need to explicitly identify the argument. The function above, with its three arguments made explicit is dat$Energy<- apply(X = dat[ , Likerts], MARGIN = 1, FUN = mean). The argument identifiers are X, MARGIN, and FUN. In our example, I made “MARGIN =” explicit to bring our attention to it but we can omit the words “MARGIN =” and simply use the value (1 in our case) because it is the second argument in the function.

In our example, our data are dat[ ,Likerts], as specified by our first argument.
The second argument tells R to apply the function to the first margin (MARGIN = 1), which refers to the rows. (The second margin would apply the mean to the columns, which is not what we seek to do here.) Just as in indexing, the first margin refers to rows and the second margin to columns. In other words, we wish to apply the mean function along each row (the first margin) and across the columns to get a mean score for each row.¹⁹ The result is a vector of values.
Look again at our code and notice that the arguments are separated by commas:

dat$Energy<- apply(dat[ ,Likerts], 1, mean)

8.2 Appending variables to our data frame

Finally, because the result is the same length as the number of rows in our data frame, we can append it as a new column using the $ symbol and a new name, “Energy”, that we create on the fly: dat$Energy.²⁰

Let’s again examine our new data frame to see the new variable we created. Notice that it is the last column. Also, let’s make sure the calculation was correct: For Person 1, the mean across the responses was $\frac{5+5+4+3+2}{5}=3.8$, which is the same value reported under the Energy column.

head(dat)

##   PID Lik1 Lik2 Lik3 Lik4 Lik5 Teacher Energy
## 1   1    5    5    4    3    2      No    3.8
## 2   2    2    2    3    1    2     Yes    2.0
## 3   3    4    4    3    3    2      No    3.2
## 4   4    2    2    2    1    2     Yes    1.8
## 5   5    5    5    3    5    4      No    4.4
## 6   6    1    1    2    2    3     Yes    1.8

Try this apply() function out on the same data and create a new variable called EnergySum (or use whatever variable name you like, as long as it begins with a character). In other words, instead of the mean, calculate a total score based on the sum of responses across the Likert-type questions.²¹ You should see this as the result:

##   PID Lik1 Lik2 Lik3 Lik4 Lik5 Teacher Energy EnergySum
## 1   1    5    5    4    3    2      No    3.8        19
## 2   2    2    2    3    1    2     Yes    2.0        10
## 3   3    4    4    3    3    2      No    3.2        16
## 4   4    2    2    2    1    2     Yes    1.8         9
## 5   5    5    5    3    5    4      No    4.4        22
## 6   6    1    1    2    2    3     Yes    1.8         9

References

Carifio, J., & Perla, R. (2008). Resolving the 50-year debate around using and misusing likert scales. Medical Education, 42, 1150–1152. https://doi.org/10.1111/j.1365-2923.2008.03172.x

Norman, G. (2010). Likert scales, levels of measurement and the “laws” of statistics. Advances in Health Sciences Education, 15, 625–632. https://doi.org/10.1007/s10459-010-9222-y

We are applying the mean() function across Person 1’s responses to the Likert-type questions, then applying mean() across Person 2’s responses, and so forth until the last person. If we wanted to know the mean response on each column, such as for each variable across all the people, we’d specify MARGIN = 2.↩︎
We can name our new variable anything we like, as long as it begins with a character. Numbers, periods, and underscores, _, are also acceptable in variable names, but spaces and special characters should be avoided, as should words that are used for functions, such as c or mean. Remember that R is case sensitive!↩︎
With composite scores calculated from multiple items that all use the same scale, you might find the mean to be more informative than the sum because the composite score can be interpreted on the same scale. For instance, if each of our items’ response scales was 1 = Very low energy, 2 = Moderately low energy, 3 = Medium level of energy, 4 = Moderately high energy, and 5 = Very high energy, we can interpret Person 1’s score of 3.8 on Energy as being just below a moderately high energy level.↩︎