## Transforming Data in R

In the activity Linear Regression in R, we showed how to calculate and plot the "line of best fit" for a set of data. As a quick reminder, consider the normal average January minimum temperatures in 56 American cities, presented at the following URL:

http://lib.stat.cmu.edu/DASL/Datafiles/USTemperatures.html

This file is one of many data sets stored at the Data and Story Library. We've adapted the file to make it easier to import into R. You can download the adapted file at http://msenux.redwoods.edu/mathdept/R/USTemperatures.txt. Take note of the folder name where you save this file as **USTemperatures.txt**.

In the activity Importing Data in R, we showed one of the simplest methods to import a datafile into a dataframe in R.

> USTemps=read.table(file=file.choose(),header=TRUE)

The option **file.choose()** will pop open a dialog that allows the user to browse through their directory structure in an accustomed manner. Locate the file **USTemperatures.txt** in your directory structure, then click the **Open** button.

> USTemps=read.table(file=file.choose(),header=TRUE) > USTemps Temp Lat Long 1 44 31.2 88.5 2 38 32.9 86.8 3 35 33.6 112.5 4 31 35.4 92.8 5 47 34.3 118.7 ... ... ... 56 14 41.2 104.9

The first variable **Temp** contains the average January low temperature for a particular US city having latitude and longtitude stored in the second and third variables, **Lat** and **Long**, respectively.

We now calculate the equation of the line of best fit.

> res=lm(Temp~Lat,data=USTemps) > res Call: lm(formula = Temp ~ Lat, data = USTemps) Coefficients: (Intercept) Lat 108.728 -2.110

Hence, the equation of the line of best fit is given by:

Temp = -2.110 Lat + 108.728.

It's a simple matter to produce a scatterplot and the line of best fit.

> plot(Temp~Lat,data=USTemps) > abline(res)

The above commands produce the scatterplot and the line of best fit in Figure 1.

Figure 1. Plot of temperature versus latitude and the line of best fit.

In Figure 1, we see a general downward linear trend (which explains the negative slope). As the latitude increases, we move northward from the equator (which has latitude zero) and the temperature gets colder.

### The Power Function

A *power function* has the following form.

**The Power Function**

Figure 2. Note the location of the variable *x* in the power function.

It's important to note the location of the variable *x* in the power function in Figure 2. Later in this activity, we will contrast the power function with the exponential function, where the variable *x* is an exponent, rather than the base as in the equation in Figure 2.

Let's look at an example, choosing the power function *y = 3x ^{2}*. First, let's produce a plot.

> x=seq(1,5,0.5) > y=3*x^2 > plot(x,y)

The above sequence produces the plot shown in Figure 3.

Figure 3. An example of a power function.

It could be argued that the data is somewhat linear, showing a general upward trend. However, it is more likely (based on Figure 3) that the function is nonlinear, due to the general bend in curve shown in Figure 3.

#### Tranforming the Data

Let's take the logarithm of both sides of the power function *y = 3x ^{2}*. The base of the logarithm is irrelevant; however,

*log x*is understood to be the natural logarithm (base

*e*) in R. That is,

*log x = log*in R.

_{e}x*log y = log 3x ^{2}*

The log of a product is the sum of the logs, so we can write the following.

*log y = log 3 + log x ^{2}*

Another property of logs allows us to move the exponent down.

*log y = log 3 + 2 log x*

This last form shows that if we plot the log of *y* versus the log of *x*, the graph will be linear with slope 2 and intercept *log 3*.

> plot(log(x),log(y))

The above command produces the plot shown in Figure 4.

Figure 4. Plotting the log of *y* versus the log of *x* produces a line.

Note that the plot of *log y* versus *log x* is linear!

It's interesting to find the line of best fit for the transformed data.

> res=lm(log(y)~log(x)) > res Call: lm(formula = log(y) ~ log(x)) Coefficients: (Intercept) log(x) 1.099 2.000

The slope is 2, which is the slope indicated in *log y = log 3 + 2 log x*. Supposedly, the intercept should be *log 3*.

> log(3) [1] 1.098612

This result agrees with the intercept found using R's **lm** command as reported in the variable **res**.

An important lesson has been learned.

#### Resting Metabolic Rate

Let's apply what we've learned about power functions to a concrete example. In the table that follows, the *resting metabolic rates* of several classs of primates are presented along with the mass of the primate class. The resting metabolic rate (RMR) is defined to be the amount of energy required by the body when the body is doing nothing. This data is taken from *Leonard and Robinson (1997, AJPA)* and repeated in and article by Marcus Hamilton, Model-Fitting with Linear Regression: Power Functions.

Resting Metabolic Rate in Primates | ||

Species | Weight | RMR |
---|---|---|

A. palliata | 8.5 | 363 |

A. palliata | 6.4 | 293 |

A. trivirgatus | 0.85 | 46 |

A. geoffroyi | 8.41 | 346 |

C. molloch | 0.7 | 54 |

C. apella | 2.6 | 143 |

C. albifrons | 2.4 | 135 |

S. imperator | 0.4 | 35 |

S. fusicollis | 0.3 | 28 |

S. sciureus | 0.8 | 66 |

C. albigena | 7.9 | 327 |

C. guereza | 7 | 265 |

M. fascicularis | 5.5 | 331 |

P. anubis | 29.3 | 956 |

P. anubis | 13 | 520 |

H. lar | 6 | 292 |

P. troglodytes | 39.5 | 1036 |

P. troglodytes | 29.8 | 839 |

P. pygmaeus | 83.6 | 1948 |

P. pygmaeus | 37.8 | 1074 |

S. syndactylus | 10.5 | 408 |

!Kung | 46 | 1383 |

!Kung | 41 | 1099 |

Ache | 59.6 | 1591 |

Ache | 51.8 | 1394 |

We've massaged the data and stored it in the file RMR.txt. Download the file and save it as **RMR.txt**. Take note of the folder in which you save the file. Read the data into R as follows:

> primates=read.table(file=file.choose(),header=TRUE) > primates Weight RMR 1 8.50 363 2 6.40 293 3 0.85 46 4 8.41 346 5 0.70 54 6 2.60 143 7 2.40 135 8 0.40 35 9 0.30 28 10 0.80 66 11 7.90 327 12 7.00 265 13 5.50 331 14 29.30 956 15 13.00 520 16 6.00 292 17 39.50 1036 18 29.80 839 19 83.60 1948 20 37.80 1074 21 10.50 408 22 46.00 1383 23 41.00 1099 24 59.60 1591 25 51.80 1394

Plot the data.

> plot(RMR~Weight,data=primates)

The above command produces the plot shown in Figure 5.

Figure 5. Plotting RMR versus Weight.

One could argue that the data has a general linear appearance. However, there is evidence in Figure 5 of a slightly concave down bend to the data, suggesting that we might use a power function (perhaps the square root function) to fit the data. This encourages us to try plot the logarithm of the RMR versus the logarithm of the Weight as follows.

> plot(log(RMR)~log(Weight),data=primates)

The above command produces the plot shown in Figure 6.

Figure 6. Plotting the logarithm of the RMR versus the logarithm of the Weight.

Aha! The plot in Figure 6 shows a definite linear trend. Therefore, our suspicion that a power function might fit the original data set is probably valid (there are actual statistical tests for this assumption which we will explore in later activities). Let's find the line of best fit for the transformed data in Figure 6.

> res=lm(log(RMR)~log(Weight),data=primates) > res Call: lm(formula = log(RMR) ~ log(Weight), data = primates) Coefficients: (Intercept) log(Weight) 4.2409 0.7599

This result tells us that we can fit a linear model to the transformed data as follows:

*log RMR = 4.2409 + 0.7599 log Weight*

Exponentiate both sides of the above equation.

*e ^{log RMR} = e^{4.2409 + 0.7599 log Weight}*

On the left, the exponential and the logarithm are inverses. On the right, we use a property of exponents.

*RMR = e ^{4.2409} e^{0.7599 log Weight}*

We use a property of logarithms to move the 0.7599 into the exponent.

*RMR = e ^{4.2409} e^{log Weight0.7599}*

We evaluate *e ^{4.2409}*.

> exp(4.2409) [1] 69.47035

This simplifies further our result.

*RMR = 69.47035 e ^{log Weight0.7599}*

Finally, we again use the fact that the logarithm and exponential are inverses.

*RMR = 69.47035 Weight ^{0.7599}*

Note that this final function has the form *y = a x ^{b}*, a power function that should "fit" the original data set.

Let's provide some visual verification of our model. First, find the range (min and max) of the Weight data.

> range(primates$Weight) [1] 0.3 83.6

Plot the original data, then superimpose a plot of the power function that we think will "fit" the data.

> plot(RMR~Weight,data=primates) > x=seq(0.3,83.6,0.1) > y=69.47035*x^0.7599 > lines(x,y,type="l",lwd=2,col="red")

These commands produce the scatterplot and the power function that "fits" the data shown in Figure 7.

Figure 7. The power function "fits" the original data.

### The Exponential Function

An *exponential function* has the following form.

**The Exponential Function**

Figure 8. In the exponential function, the independent variable is the exponent.

In the power function, the independent variable was the base, as in *y = a x ^{b}*. In the exponential function, the independent variable is now an exponent, as in

*y = a b*. This is a subtle but important difference.

^{x}Let's look at an example of an exponential function, choosing the example *y = 3(2 ^{x}*). First, let's produce a plot.

> x=seq(1,5,0.5) > y=3*2^x > plot(x,y)

The above sequence produces the plot shown in Figure 9.

Figure 9. An example of a exponential function.

The function in Figure 9 is clearly nonlinear.

#### Tranforming the Data

Let's take the logarithm of both sides of the exponential function *y = 3(2 ^{x})*. The base of the logarithm is irrelevant; however,

*log x*is understood to be the natural logarithm (base

*e*) in R. That is,

*log x = log*in R.

_{e}x*log y = log 3(2 ^{x})*

The log of a product is the sum of the logs, so we can write the following.

*log y = log 3 + log 2 ^{x}*

Another property of logs allows us to move the exponent down.

*log y = log 3 + x log 2*

This last form shows that if we plot the log of *y* versus *x*, the graph will be linear with slope *log 2* and intercept *log 3*.

> plot(log(x),log(y))

The above command produces the plot shown in Figure 4.

Figure 4. Plotting the log of *y* versus *x* produces a line.

Note that the plot of *log y* versus *x* is linear!

It's interesting to find the line of best fit for the transformed data.

> res=lm(log(y)~x) > res Call: lm(formula = log(y) ~ x) Coefficients: (Intercept) x 1.0986 0.6931

The slope is 0.6931, which agrees with the slope indicated in *log y = log 3 + x log 2*; that is, the slope is *log 2*.

> log(2) [1] 0.6931472

The intercept is 1.098, which agrees with the intercept indicated in *log y = log 3 + x log 2*; that is, the intercept is *log 3*.

> log(3) [1] 1.098612

This result agrees with the intercept found using R's **lm** command as reported in the variable **res**.

An important lesson has been learned.

#### Cell Phone Subscribers

Let's apply what we've learned to a concrete example. In the table that follows, the number of cell phone subscriptions (in millions) is presented as a function of the number of years that have passed since 1987.

Cell Phone Subscribers | |

t (years since 1987) | subscribers (in millions) |
---|---|

1 | 1.6 |

2 | 2.7 |

3 | 4.4 |

4 | 6.4 |

5 | 8.9 |

6 | 13.1 |

7 | 19.3 |

8 | 28.2 |

9 | 38.2 |

10 | 48.7 |

Enter and plot the data.

> t=seq(1,10) > subscribers=c(1.6,2.7,4.4,6.4,8.9,13.1,19.3,28.2,38.2,48.7) > plot(t,subscribers)

These commands produce the scatterplot shown in Figure 11.

Figure 11. Cell phone subscribers versus years since 1987.

We suspect exponential growth. This leads us to plot the logarithm of the subscriber data versus the year since 1987.

> plot(t,log(subscribers)) > res=lm(log(subscribers)~t) > abline(res)

These commands produce the scatterplot and the line of best fit shown in Figure 12.

Figure 12. Plotting log of subscribers versus the years since 1987.

The data of Figure 12 certainly appear to be linear. Let's examine the model of the line of best fit.

> res Call: lm(formula = log(subscribers) ~ t) Coefficients: (Intercept) t 0.2630 0.3774

This leads us to the following model.

*log subscribers = 0.2630 + 0.3774t*

Exponentiate both sides of the last equation.

*e ^{log subscribers} = e^{0.2630 + 0.3774t}*

On the left, the exponential and logarithm are inverses. On the right, the exponential of a sum is the product of the exponentials.

*subscribers = e ^{0.2630} e^{0.3774t}*

Finally, we can evaluate *e ^{0.2630}*.

> exp(0.2630) [1] 1.300827

This leads to the final form of the exponential model.

*subscribers = 1.300827 e ^{0.3774t}*

Let's provide some visual verification of our model.

> plot(t,subscribers) > t=seq(1,10,0.1) > y=1.300827*exp(0.3774*t) > lines(t,y,type="l",lwd=2,col="red")

The above commands will plot the original data, then superimpose a plot of the exponential function that we found to "fit" the data.

Figure 13. The exponential function 'fits' the original data.

Not bad!

### Enjoy!

We hope you enjoyed this introduction to the principles of transforming data in the R system. In upcoming activities, we will discuss further what is meant by a "good fit."