Transforming Data in R

In the activity Linear Regression in R, we showed how to calculate and plot the "line of best fit" for a set of data. As a quick reminder, consider the normal average January minimum temperatures in 56 American cities, presented at the following URL:

http://lib.stat.cmu.edu/DASL/Datafiles/USTemperatures.html

This file is one of many data sets stored at the Data and Story Library. We've adapted the file to make it easier to import into R. You can download the adapted file at http://msenux.redwoods.edu/mathdept/R/USTemperatures.txt. Take note of the folder name where you save this file as USTemperatures.txt.

In the activity Importing Data in R, we showed one of the simplest methods to import a datafile into a dataframe in R.

> USTemps=read.table(file=file.choose(),header=TRUE)

The option file.choose() will pop open a dialog that allows the user to browse through their directory structure in an accustomed manner. Locate the file USTemperatures.txt in your directory structure, then click the Open button.

> USTemps=read.table(file=file.choose(),header=TRUE)
> USTemps
   Temp  Lat  Long
1    44 31.2  88.5
2    38 32.9  86.8
3    35 33.6 112.5
4    31 35.4  92.8
5    47 34.3 118.7
...
...
...
56   14 41.2 104.9

The first variable Temp contains the average January low temperature for a particular US city having latitude and longtitude stored in the second and third variables, Lat and Long, respectively.

We now calculate the equation of the line of best fit.

> res=lm(Temp~Lat,data=USTemps)
> res

Call:
lm(formula = Temp ~ Lat, data = USTemps)

Coefficients:
(Intercept)          Lat  
    108.728       -2.110 

Hence, the equation of the line of best fit is given by:

Temp = -2.110 Lat + 108.728.

It's a simple matter to produce a scatterplot and the line of best fit.

> plot(Temp~Lat,data=USTemps)
> abline(res)

The above commands produce the scatterplot and the line of best fit in Figure 1.

Plot of temperature versus latitude and the line of best fit.

Figure 1. Plot of temperature versus latitude and the line of best fit.

In Figure 1, we see a general downward linear trend (which explains the negative slope). As the latitude increases, we move northward from the equator (which has latitude zero) and the temperature gets colder.

The Power Function

A power function has the following form.

The Power Function

The power function.

Figure 2. Note the location of the variable x in the power function.

It's important to note the location of the variable x in the power function in Figure 2. Later in this activity, we will contrast the power function with the exponential function, where the variable x is an exponent, rather than the base as in the equation in Figure 2.

Let's look at an example, choosing the power function y = 3x2. First, let's produce a plot.

> x=seq(1,5,0.5)
> y=3*x^2
> plot(x,y)

The above sequence produces the plot shown in Figure 3.

An example of a power function.

Figure 3. An example of a power function.

It could be argued that the data is somewhat linear, showing a general upward trend. However, it is more likely (based on Figure 3) that the function is nonlinear, due to the general bend in curve shown in Figure 3.

Tranforming the Data

Let's take the logarithm of both sides of the power function y = 3x2. The base of the logarithm is irrelevant; however, log x is understood to be the natural logarithm (base e) in R. That is, log x = loge x in R.

log y = log 3x2

The log of a product is the sum of the logs, so we can write the following.

log y = log 3 + log x2

Another property of logs allows us to move the exponent down.

log y = log 3 + 2 log x

This last form shows that if we plot the log of y versus the log of x, the graph will be linear with slope 2 and intercept log 3.

> plot(log(x),log(y))

The above command produces the plot shown in Figure 4.

Plotting the log of <i>y</i> versus the log of <i>x</i> produces a line.

Figure 4. Plotting the log of y versus the log of x produces a line.

Note that the plot of log y versus log x is linear!

It's interesting to find the line of best fit for the transformed data.

> res=lm(log(y)~log(x))
> res

Call:
lm(formula = log(y) ~ log(x))

Coefficients:
(Intercept)       log(x)  
      1.099        2.000  

The slope is 2, which is the slope indicated in log y = log 3 + 2 log x. Supposedly, the intercept should be log 3.

> log(3)
[1] 1.098612

This result agrees with the intercept found using R's lm command as reported in the variable res.

An important lesson has been learned.

Important Result: With any power function, the graph of the logarithm of the dependent variable versus the logarithm of the independent variable will be a line. Thus, if the graph of the logarithm of the response variable versus the logarithm of the independent variable is a line, then we should suspect that the relationship between the original variables is that of a power function.

Resting Metabolic Rate

Let's apply what we've learned about power functions to a concrete example. In the table that follows, the resting metabolic rates of several classs of primates are presented along with the mass of the primate class. The resting metabolic rate (RMR) is defined to be the amount of energy required by the body when the body is doing nothing. This data is taken from Leonard and Robinson (1997, AJPA) and repeated in and article by Marcus Hamilton, Model-Fitting with Linear Regression: Power Functions.

Resting Metabolic Rate in Primates
Species Weight RMR
A. palliata 8.5 363
A. palliata 6.4 293
A. trivirgatus 0.85 46
A. geoffroyi 8.41 346
C. molloch 0.7 54
C. apella 2.6 143
C. albifrons 2.4 135
S. imperator 0.4 35
S. fusicollis 0.3 28
S. sciureus 0.8 66
C. albigena 7.9 327
C. guereza 7 265
M. fascicularis 5.5 331
P. anubis 29.3 956
P. anubis 13 520
H. lar 6 292
P. troglodytes 39.5 1036
P. troglodytes 29.8 839
P. pygmaeus 83.6 1948
P. pygmaeus 37.8 1074
S. syndactylus 10.5 408
!Kung 46 1383
!Kung 41 1099
Ache 59.6 1591
Ache 51.8 1394

We've massaged the data and stored it in the file RMR.txt. Download the file and save it as RMR.txt. Take note of the folder in which you save the file. Read the data into R as follows:

> primates=read.table(file=file.choose(),header=TRUE)
> primates
   Weight  RMR
1    8.50  363
2    6.40  293
3    0.85   46
4    8.41  346
5    0.70   54
6    2.60  143
7    2.40  135
8    0.40   35
9    0.30   28
10   0.80   66
11   7.90  327
12   7.00  265
13   5.50  331
14  29.30  956
15  13.00  520
16   6.00  292
17  39.50 1036
18  29.80  839
19  83.60 1948
20  37.80 1074
21  10.50  408
22  46.00 1383
23  41.00 1099
24  59.60 1591
25  51.80 1394

Plot the data.

> plot(RMR~Weight,data=primates)

The above command produces the plot shown in Figure 5.

Plotting RMR versus Weight.

Figure 5. Plotting RMR versus Weight.

One could argue that the data has a general linear appearance. However, there is evidence in Figure 5 of a slightly concave down bend to the data, suggesting that we might use a power function (perhaps the square root function) to fit the data. This encourages us to try plot the logarithm of the RMR versus the logarithm of the Weight as follows.

> plot(log(RMR)~log(Weight),data=primates)

The above command produces the plot shown in Figure 6.

Plotting the logarithm of the RMR versus the logarithm of the Weight.

Figure 6. Plotting the logarithm of the RMR versus the logarithm of the Weight.

Aha! The plot in Figure 6 shows a definite linear trend. Therefore, our suspicion that a power function might fit the original data set is probably valid (there are actual statistical tests for this assumption which we will explore in later activities). Let's find the line of best fit for the transformed data in Figure 6.

> res=lm(log(RMR)~log(Weight),data=primates)
> res

Call:
lm(formula = log(RMR) ~ log(Weight), data = primates)

Coefficients:
(Intercept)  log(Weight)  
     4.2409       0.7599  

This result tells us that we can fit a linear model to the transformed data as follows:

log RMR = 4.2409 + 0.7599 log Weight

Exponentiate both sides of the above equation.

elog RMR = e4.2409 + 0.7599 log Weight

On the left, the exponential and the logarithm are inverses. On the right, we use a property of exponents.

RMR = e4.2409 e0.7599 log Weight

We use a property of logarithms to move the 0.7599 into the exponent.

RMR = e4.2409 elog Weight0.7599

We evaluate e4.2409.

> exp(4.2409)
[1] 69.47035

This simplifies further our result.

RMR = 69.47035 elog Weight0.7599

Finally, we again use the fact that the logarithm and exponential are inverses.

RMR = 69.47035 Weight0.7599

Note that this final function has the form y = a xb, a power function that should "fit" the original data set.

Let's provide some visual verification of our model. First, find the range (min and max) of the Weight data.

> range(primates$Weight)
[1]  0.3 83.6

Plot the original data, then superimpose a plot of the power function that we think will "fit" the data.

> plot(RMR~Weight,data=primates)
> x=seq(0.3,83.6,0.1)
> y=69.47035*x^0.7599
> lines(x,y,type="l",lwd=2,col="red")

These commands produce the scatterplot and the power function that "fits" the data shown in Figure 7.

The power function

Figure 7. The power function "fits" the original data.

The Exponential Function

An exponential function has the following form.

The Exponential Function

In the exponential function, the independent variable is the exponent.

Figure 8. In the exponential function, the independent variable is the exponent.

In the power function, the independent variable was the base, as in y = a xb. In the exponential function, the independent variable is now an exponent, as in y = a bx. This is a subtle but important difference.

Let's look at an example of an exponential function, choosing the example y = 3(2x). First, let's produce a plot.

> x=seq(1,5,0.5)
> y=3*2^x
> plot(x,y)

The above sequence produces the plot shown in Figure 9.

An example of a exponential function.

Figure 9. An example of a exponential function.

The function in Figure 9 is clearly nonlinear.

Tranforming the Data

Let's take the logarithm of both sides of the exponential function y = 3(2x). The base of the logarithm is irrelevant; however, log x is understood to be the natural logarithm (base e) in R. That is, log x = loge x in R.

log y = log 3(2x)

The log of a product is the sum of the logs, so we can write the following.

log y = log 3 + log 2x

Another property of logs allows us to move the exponent down.

log y = log 3 + x log 2

This last form shows that if we plot the log of y versus x, the graph will be linear with slope log 2 and intercept log 3.

> plot(log(x),log(y))

The above command produces the plot shown in Figure 4.

Plotting the log of <i>y</i> versus <i>x</i> produces a line.

Figure 4. Plotting the log of y versus x produces a line.

Note that the plot of log y versus x is linear!

It's interesting to find the line of best fit for the transformed data.

> res=lm(log(y)~x)
> res

Call:
lm(formula = log(y) ~ x)

Coefficients:
(Intercept)            x  
     1.0986       0.6931 

The slope is 0.6931, which agrees with the slope indicated in log y = log 3 + x log 2; that is, the slope is log 2.

> log(2)
[1] 0.6931472

The intercept is 1.098, which agrees with the intercept indicated in log y = log 3 + x log 2; that is, the intercept is log 3.

> log(3)
[1] 1.098612

This result agrees with the intercept found using R's lm command as reported in the variable res.

An important lesson has been learned.

Important Result: With any exponential function, the graph of the logarithm of the dependent variable versus the independent variable will be a line. Thus, if the graph of the logarithm of the response variable versus the independent variable is a line, then we should suspect that the relationship between the original variables is that of a exponential function.

Cell Phone Subscribers

Let's apply what we've learned to a concrete example. In the table that follows, the number of cell phone subscriptions (in millions) is presented as a function of the number of years that have passed since 1987.

Cell Phone Subscribers
t (years since 1987) subscribers (in millions)
1 1.6
2 2.7
3 4.4
4 6.4
5 8.9
6 13.1
7 19.3
8 28.2
9 38.2
10 48.7

Enter and plot the data.

> t=seq(1,10)
> subscribers=c(1.6,2.7,4.4,6.4,8.9,13.1,19.3,28.2,38.2,48.7)
> plot(t,subscribers)

These commands produce the scatterplot shown in Figure 11.

Cell phone subscribers versus years since 1987.

Figure 11. Cell phone subscribers versus years since 1987.

We suspect exponential growth. This leads us to plot the logarithm of the subscriber data versus the year since 1987.

> plot(t,log(subscribers))
> res=lm(log(subscribers)~t)
> abline(res)

These commands produce the scatterplot and the line of best fit shown in Figure 12.

Plotting log of subscribers versus the years since 1987.

Figure 12. Plotting log of subscribers versus the years since 1987.

The data of Figure 12 certainly appear to be linear. Let's examine the model of the line of best fit.

> res

Call:
lm(formula = log(subscribers) ~ t)

Coefficients:
(Intercept)            t  
     0.2630       0.3774  

This leads us to the following model.

log subscribers = 0.2630 + 0.3774t

Exponentiate both sides of the last equation.

elog subscribers = e0.2630 + 0.3774t

On the left, the exponential and logarithm are inverses. On the right, the exponential of a sum is the product of the exponentials.

subscribers = e0.2630 e0.3774t

Finally, we can evaluate e0.2630.

> exp(0.2630)
[1] 1.300827

This leads to the final form of the exponential model.

subscribers = 1.300827 e0.3774t

Let's provide some visual verification of our model.

> plot(t,subscribers)
> t=seq(1,10,0.1)
> y=1.300827*exp(0.3774*t)
> lines(t,y,type="l",lwd=2,col="red")

The above commands will plot the original data, then superimpose a plot of the exponential function that we found to "fit" the data.

The exponential function 'fits' the original data.

Figure 13. The exponential function 'fits' the original data.

Not bad!

Enjoy!

We hope you enjoyed this introduction to the principles of transforming data in the R system. In upcoming activities, we will discuss further what is meant by a "good fit."