# Linear Regression Using R: An Introduction to Data Modeling

David J. Lilja, University of Minnesota

Copyright Year: 2016

ISBN 13: 9781946135001

Publisher: University of Minnesota Libraries Publishing

Language: English

## Formats Available

## Conditions of Use

Attribution-NonCommercial

CC BY-NC

## Reviews

I was thinking of using this for a course in linear regression but unfortunately this doesn't include any of the mathematics behind the analysis. This can be used simply to learn how to do regression in R in a few hours without any knowledge of... read more

I was thinking of using this for a course in linear regression but unfortunately this doesn't include any of the mathematics behind the analysis. This can be used simply to learn how to do regression in R in a few hours without any knowledge of the algorithm used or the mathematics that goes into it.

The material and the CSV files provided for practicing regression with data is accurate and useful. The pictures included in the body of the text are very helpful to new R learners.

It could be more comprehensive as a textbook in linear regression. But it simply doesn't work as a college level textbook. Perhaps just used for one week of the course.

The writing and the pictures and the data attached are very accessible and clear. It's amazing that the content is hyperlinked,too.

The material is consistently taught at the same level of difficulty which makes is accessible for high school or middle school students.

The way the content is organized makes it easy to read and follow the topics with a good flow.

The order of the material presented in the book is logical and accessible for an easy level of difficulty.

The images and hyperlinked content are string areas of the book that attracted me to browse through.

I didn't notice any grammatical errors.

Nothing insensitive in the data examples or content.

More appropriate for elementary audience of linear regression. Not appropriate for a data science program. I would recommend this for middle school or high school students.

It is very focused and short treatment of simple and multiple linear regression. It does not cover categorical predictors, interactions of predictors, doesn't spend much time interpreting slope coefficients, or discuss confidence intervals for... read more

It is very focused and short treatment of simple and multiple linear regression. It does not cover categorical predictors, interactions of predictors, doesn't spend much time interpreting slope coefficients, or discuss confidence intervals for them, and has some technical issues with the methods discussed. So it is not a very comprehensive treatment of the topic but does provide a short introduction to simple and multiple linear regression model building with quantitative predictors and how to use R to do that. I like Chapter 5 best and the diagram for training and validation and how this fits into an engineering perspective on statistical model building and evaluation.

There are issues of bias not discussed that could be discussed with the missing data discussion and the sloppiness of this discussion propagates into issues with models and comparing models for different sets of responses that are problematic. Some of the notation is a little bit loose for the statistical modeling and inferences discussed – for example there is no distinction between population and sample estimates in the models. The example used is interesting to the potential engineering audience but there is little discussion of sampling of subjects/measurements and implication of that – especially as applied to the data set. There are some incorrect interpretations of p-values and the overall F-test. And some modeling choices are difficult to understand (including both square-root and original versions of the same predictors – why do we need both? How would you interpret the model with both? Doesn't this create multicollinearity issues?).

It presents a good introduction to the topics promised. I am not sure this would be enough for a full semester of a course – more of a unit for a course. And it is a bit sparse on tools to prepare students to try to use these methods on their own similar data sets. It also may get dated with changes to the statistical software used within it – although the code used should last because it just uses “base R” functions. This also means it doesn’t leverage any evolutions that are occurring in statistical computing

It is well written. Some of the technical issues cause me concern in using it for a course, but it is a nice book to read and has some interesting points covered.

It is internally consistent with how it presents the material.

It is not very modular and I would use all of it together as it builds on earlier material.

The organization of the topics is reasonable.

The links appear to be dynamic within the document but do not seem to work on my system. It is otherwise a nice platform and formatting for the material.

I am a bit pedantic on this, but “data” in statistics is a plural word and “datum” is singular. Some other wording issues would be characterized/defined differently if written by a statistician. These issues would prevent my use of the book in my classes. The word “significant” is used in multiple ways without clarification that it can be used to mean very different things.

The one example data set relates to computers and there is nothing else that might either possibly allow incorporation of, or cause concerns about, different cultural perspectives.

There are some technical issues with the discussions and examples that a discussion with a statistician could aid in resolving. I think it could resonate with an engineering audience and with some modest changes be very successful at presenting the material to that audience. I really liked Chapter 5 and found that to be the most successful topic presented.

For a introduction/tutorial to linear regressions with R, this book quickly guides a novice to building a linear model and testing it. read more

For a introduction/tutorial to linear regressions with R, this book quickly guides a novice to building a linear model and testing it.

My only problem is, the author calls variables in data sets "parameters". Within the context of linear regressions, I believe the term "parameters" should be reserved for coefficients in the model that will be estimated.

By showing linear regressions with the statistical software R, the book gives a modern and hands on approach the material.

I think the best thing about this book is it's clarity. The clear and concise language of this book makes it very friendly to readers.

By using one variable that is modeled throughout the entire book, it allows for a nice connectiveness between chapters.

The book is nicely broken down into easily digestible parts.

Topics appropriately build on each another.

I experienced no interface issues.

I detected no grammar issues.

The book is free of any cultural sensitive topics.

For the potential reader with little R programming and data science background, this book quickly allows someone to build a linear model from a given data set. Also, the book has a nice introduction to training and testing a linear model. With the authors clear and easy to read explanations, this will be a text that I will refer to people to for quickly running linear regressions in R.

There are basic functions such as class() or typeof() that should be introduced early on for any user of R. Also, A practical explanation of residual standard error or what a nonsensical model for the example used throughout the text would be... read more

There are basic functions such as class() or typeof() that should be introduced early on for any user of R. Also, A practical explanation of residual standard error or what a nonsensical model for the example used throughout the text would be helpful for a beginner.

Using vocabulary to help student differentiate between an assumed model and a prediction equation would be helpful if you are planning to use this as a classroom text. Depending on how you are used to teaching regression, you may find many problematic uses of vocabulary or you may find none.

This text could easily be updated by either replacing parts or by adding new material.

The main vocab is touched on and explained well, minus some possible misuse of terminology depending upon how one teaches regression. As for technical R vocab, the use of 'row' early on in the text to describe the header of a data frame could also be problematic since the first row of a data frame typically refers to the first row of data, not the names of the columns.

Delivery and organization is consistent throughout.

Though additional material is needed in between, what is presented is nicely laid out.

The text follows the typical presentation of a traditional look at regression, which makes for a text that is clear and well organized.

There are places where code chunks unnecessarily spill over to the next page and some figures/tables that need to be relocated so that the reader does not come to them before they have actually been referred to in the text.

No issues

I think all is fine. I wouldn't see any particular computer processor feeling like they have been misrepresented or purposely left out.

I hope this review actually goes through this time??? It is my third attempt at trying to complete this before Qualtrics times me out. Sorry that I am so slow....... Also, below is a paragraph style review. I wrote this before seeing the actual format was going to be a survey type of setup. Though well written and organized, this book may not be your "go to" resource if you're looking for a textbook or supplementary material when teaching an intro R and/or regression course. The author does use an example throughout that many can understand at least to some degree (influencers of computer performance), which exposes the reader to the useful concept that knowledge about a data set can be extremely useful. Further, the introduction of functions like attach() and update() are examples of how the author has nicely woven into the content a practical approach to how coding is part of analysis. The exploratory use of plot() to visualize the data before introducing a one-factor regression is another positive example of this. However, there are some places throughout the book that might make you seriously question whether you could teach a course using this book (either as a stand-alone resource or just a supplementary one). The wording in some places can be confusing or even contradictory depending upon how you present regression, especially as an introduction to the topic where consistent use of vocabulary can be crucial. For example, consider in Section 3.2 where the mathematical form in (3.1) is referred to as the 'simplest regression model' yet the regression equation in (3.2) is similarly referred to as the 'final regression model'. Personally, I try to differentiate these two things for students first learning these concepts by stressing 'assumed model' using y as response and 'prediction equation' using y_hat as the response. Maybe this example isn't problematic for you, nonetheless I still suggest you carefully look through the entire book before adopting it as a resource for your students.

This is a tutorial that covers basic areas and ideas of linear regression. It covers this material through carefully selected examples. R, the software used to present examples in the text, is an open source software which is appropriate and... read more

This is a tutorial that covers basic areas and ideas of linear regression. It covers this material through carefully selected examples. R, the software used to present examples in the text, is an open source software which is appropriate and convenient for an open textbook. The book provides an effective and complete index and table of content with page numbers as links to the text.

The open source software (R) used to present data is as accurate as any commercially available software. The rest of the content is accurate and error-free.

As in introductory text, the content is up-to-date. As a basic topic in regression theory, linear regression is here to stay. With the current growth of data mining it is difficult to imagine the future of data analytics without linear regression. The text is written and arranged in such a way that important updates will be easy to implement.

The text is clear and accessible to readers with standard elementary statistical background. It provides explicit guidance for R and the context for statistical terms is clear. The concepts are well explained.

The exposition is consistently clear and well-motivated by examples. The level and presentation is consistent as well. The text uses consistent, standard, and elementary terminology appropriately introduced to deal with linear regression models.

The text, not overly self-referential, is presented in eight chapters, each with a hyperlink to the text. Each chapter has short sections. In addition, each page number in the Index is a hyperlink to the text.

The topics in the text are well motivated by examples that should make the subject more interesting to the reader. The organization is excellent, making each topic clear and easy to read.

It would have been nice to have color images in the Figures. Also, Figure 4.1 (CHAPTER 4. MULTI-FACTOR REGRESSION) would be clearer if it showed only a few of the pairwise comparisons for the Int2000 data frame. But these are just two minor issues of display.

I did not find grammatical errors.

The text is not culturally insensitive or offensive in any way. It uses examples that are culturally neutral.

I would use this tutorial in any undergraduate course dealing with linear regression.

## Table of Contents

1 Introduction

- 1.1 What is a Linear Regression Model?
- 1.2 What is R?
- 1.3 What's Next?

2 Understand Your Data

- 2.1 Missing Values
- 2.2 Sanity Checking and Data Cleaning
- 2.3 The Example Data
- 2.4 Data Frames
- 2.5 Accessing a Data Frame

3 One-Factor Regression

- 3.1 Visualize the Data
- 3.2 The Linear Model Function
- 3.3 Evaluating the Quality of the Model
- 3.4 Residual Analysis

4 Multi-factor Regression

- 4.1 Visualizing the Relationships in the Data
- 4.2 Identifying Potential Predictors
- 4.3 The Backward Elimination Process
- 4.4 An Example of the Backward Elimination Process
- 4.5 Residual Analysis
- 4.6 When Things Go Wrong

5 Predicting Responses

- 5.1 Data Splitting for Training and Testing
- 5.2 Training and Testing
- 5.3 Predicting Across Data Sets

6 Reading Data into the R Environment

- 6.1 Reading CSV files

7 Summary

8 A Few Things to Try Next

Bibliography

Index

## About the Book

*Linear Regression Using R: An Introduction to Data Modeling* presents one of the fundamental data modeling techniques in an informal tutorial style. Learn how to predict system outputs from measured data using a detailed step-by-step process to develop, train, and test reliable regression models. Key modeling and programming concepts are intuitively described using the R programming language. All of the necessary resources are freely available online.

## About the Contributors

### Author

**David J. Lilja** received a Ph.D. and an M.S., both in Electrical Engineering, from the University of Illinois at Urbana-Champaign, and a B.S. in Computer Engineering from Iowa State University in Ames. He is currently the Louis John Schnell Professor of Electrical and Computer Engineering at the University of Minnesota in Minneapolis, where he also serves as a member of the graduate faculties in Computer Science, Scientific Computation, and Data Science. Previously, he served ten years as the head of the ECE department at the University of Minnesota, worked as a research assistant at the Center for Supercomputing Research and Development at the University of Illinois, and as a development engineer at Tandem Computers Incorporated in Cupertino, California. He received a Fulbright Senior Scholar Award to visit the University of Western Australia, and was awarded a McKnight Land-Grant Professorship by the Board of Regents of the University of Minnesota. He has chaired and served on the program committees of numerous conferences, and was a distinguished visitor of the IEEE Computer Society. He was elected a Fellow of the Institute of Electrical and Electronics Engineers (IEEE) and a Fellow of the American Association for the Advancement of Science (AAAS) for contributions to the statistical analysis of computer performance. He also is a member of the ACM, and is a registered Professional Engineer. His main research interests include computer architecture, parallel processing, computer systems performance analysis, approximate computing, and storage systems.