R Programming Fundamentals: Deal with data using various modeling techniques 9781789612998, 1789612993

Study data analysis and visualization to successfully analyze data with R Key FeaturesGet to grips with data cleaning me

242 106 9MB

English Pages 206 [202]

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

R Programming Fundamentals: Deal with data using various modeling techniques
 9781789612998, 1789612993

Citation preview

R Programming Fundamentals

Deal with data using various modeling techniques

Kaelen Medeiros

BIRMINGHAM - MUMBAI

R Programming Fundamentals Copyright © 2019 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Acquisitions Editors: Aditya Date, Bridget Neale Content Development Editor: Madhura Bal Production Coordinator: Ratan Pote First published: September 2018 Production reference: 2290719 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78961-299-8 www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why Subscribe? Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals Improve your learning with Skill Plans built especially for you Get a free eBook or video every month Mapt is fully searchable Copy and paste, print, and bookmark content

Packt.com Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors About the Author Kaelen Medeiros is a content quality developer at DataCamp, where she works to improve course content and tracks quality metrics across the company. She also works as a data scientist/developer for HealthLabs, who develop automated methods for analyzing large amounts of medical data. She received her MS in biostatistics from Louisiana State University Health Sciences Center in 2016. Outside of work, she has one cat, listens to way too many podcasts, and enjoys running.

Packt is Searching for Authors Like You If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents Preface

1

Chapter 1: Introduction to R Using R and RStudio, and Installing Useful Packages Using R and RStudio

Executing Basic Functions in the R Console Setting up a New Project

Installing Packages Activity: Installing the Tidyverse Packages

Variable Types and Data Structures Variable Types

Numeric and Integers Character Dates

Activity: Identifying Variable Classes and Types Data Structures Vectors Lists Matrices Dataframes

Activity: Creating Vectors, Lists, Matrices, and Dataframes

Basic Flow Control

If/else For loop While loop Activity: Building Basic Loops

Data Import and Export

Excel Spreadsheets Activity: Exporting and Importing the mtcars Dataset

Getting Help with R

Package Documentation and Vignettes Activity: Exploring the Introduction to dplyr Vignette RStudio Community, Stack Overflow, and the Rest of the Web

Summary Chapter 2: Data Visualization and Graphics Creating Base Plots The plot() Function Factor Variables Model Objects

Plotting More Than One Plot at a Time Creating and Plotting a Linear Model Object

5 5 6 7 11 12 14 15 15 16 17 18 20 20 21 23 24 25 26 27 28 29 33 36 37 44 46 46 47 51 52 53 54 55 55 58 61 61 62

Table of Contents

Titles and Axis Labels

Changing the Color of Base Plots

Activity: Recreating Plots with Base Plot Methods

ggplot2

ggplot2 Basics Histogram

Creating Histograms using ggplot2

Bar Chart

Creating a Bar Chart with ggplot2 using Two Different Methods

Scatterplot

Creating a Scatterplot of Two Continuous Variables

Boxplot

Creating Boxplots using ggplot2

Activity: Recreating Plots Using ggplot2 Digging into aes() Bar Chart

Using Different Bar Chart Aesthetic Options

Facet Wrapping and Gridding

Utilizing Facet Wrapping and Gridding to Visualize Data Effectively

Boxplot + coord_flip() Scatterplot

Utilizing Different Aesthetics for Scatterplots, Including Shapes, Colors, and Transparencies

Activity: Utilizing ggplot2 Aesthetics Extending the Plots with Titles, Axis Labels, and Themes

Interactive Plots

Plotly Shiny Exploring Shiny and Plotly

Summary Chapter 3: Data Management Factor Variables

Creating Factor Variables in a Dataset Using ifelse() Statements Using the recode() Function

Examining and Changing the Levels of Pre-existing Factor Variables Creating an Ordered Factor Variable Activity: Creating and Manipulating Factor Variables

Summarizing Data

Data Summarization Tables Tables in R

Creating Different Tables Using the table() Function Using dplyr Methods to Create Data Summary Tables

Activity: Creating Data Summarization Tables Summarizing Data with the Apply Family

Using the apply() Function to Create Numeric Data Summaries

Activity: Implementing Data Summary

Splitting, Combining, Merging, and Joining Datasets [ ii ]

64 67 68 71 72 77 80 83 85 87 89 91 93 94 98 101 103 105 108 109 111

113 116 120 127 128 129 129 130 131 132 139 142 143 143 145 145 146 146 148 148 151 152 154 156 157 158

Table of Contents

Splitting and Combining Data and Datasets

Splitting and Unsplitting Data with Base R and the dplyr Methods Splitting Datasets into Lists and Then Back Again Combining Data Combining Data with rbind() Combining Matrices of Objects into Dataframes Splitting Strings Using stringr Package to Manipulate a Vector of Names Combining Strings Using Base R Methods

Activity: Demonstrating Splitting and Combining Data Merging and Joining Data Demonstrating Merges and Joins in R

Activity: Merging and Joining Data

Summary Solutions

158 158 158 161 161 163 164 164 166 167 169 170 173 174 175

Other Books You May Enjoy

189

Index

192

[ iii ]

Preface Demand for data scientists is growing exponentially and demand in the US is expected to increase by 28 percent by the year 2020, with this trend reflected across the world. R is a tool often used by data scientists to clean, examine, analyze, and report on data. It is a great starting point for those familiar with analysis in Excel or MS SQL and is an excellent place to begin to learn programming fundamentals. This book begins by addressing the setup of R and RStudio on the machine and progresses from there, demonstrating how to import datasets, clean them, and explore their contents. It balances theory and exercises, and contains multiple open-ended activities that use reallife business scenarios for you to practice and apply your newly acquired skills in a highly relevant context. We have included over 50 practical activities and exercises across 11 topics, along with a mini project that will allow you to begin your data science project portfolio. With this book, we have created a definitive guide to beginning data science in R.

Who This Book is for This book is for analysts who are looking to grow their data science skills beyond the tools they have used before, such as MS Excel and other statistical tools.

What This Book Covers Chapter 1, Introduction to R, deals with installation of R, RStudio, and other useful

packages, and talks about variable types and data structures. The chapter then introduces the different kinds of loops that can be used in R, explains how to import and export data, and also talks about getting help with R programming.

Chapter 2, Data Visualization and Graphics, covers the basic plots built into R and how to create them, and then introduces ggplot, a popular graphics package in R. Finally, the chapter briefly talks about two tools, Shiny and Plotly, that can be used to design interactive plots. Chapter 3, Data Management, discusses how to create and manipulate factor variables, examine data using tables, apply the family of functions to generate summaries, and split, combine, merge, or join datasets in R.

The Appendix contains the solutions to all the activities within the chapters.

Preface

To Get the Most Out of This Book You will require a computer system with at least an i3 processor, 2 GB RAM, 10 GB of storage space, and an internet connection. Along with this, you would require the following software: 1. Operating System: Windows 8 64-bit 2. R and Rstudio 3. Browsers (Google Chrome and Mozilla Firefox - latest versions)

Download the Example Code Files You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you. You can download the code files by following these steps: 1. 2. 3. 4.

Log in or register at www.packt.com. Select the SUPPORT tab. Click on Code Downloads & Errata. Enter the name of the book in the Search box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of: WinRAR/7-Zip for Windows Zipeg/iZip/UnRarX for Mac 7-Zip/PeaZip for Linux The code bundle for the book is also hosted on GitHub at https:/​/​github.​com/ TrainingByPackt/​R-​Programming-​Fundamentals. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https:/​/​github.​com/​PacktPublishing/​. Check them out!

[2]

Preface

Conventions Used There are a number of text conventions used throughout this book. CodeInText: Indicates code words in text, database table names, folder names, filenames,

file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system." A block of code is set as follows: html, body, #map { height: 100%; margin: 0; padding: 0 }

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold: [default] exten => s,1,Dial(Zap/1|30) exten => s,2,Voicemail(u100) exten => s,102,Voicemail(b100) exten => i,1,Voicemail(s0)

Any command-line input or output is written as follows: $ mkdir css $ cd css

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel." Warnings or important notes appear like this.

Tips and tricks appear like this.

[3]

Preface

Get in Touch Feedback from our readers is always welcome. General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected]. Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details. Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material. If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you! For more information about Packt, please visit packt.com. All the solutions to the activities are present in the Appendix section.

[4]

1 Introduction to R One tool that statisticians—and now data scientists as well—often use for data cleaning, analysis, and reporting is the R programming language. In this chapter, we'll begin by looking at the basics of using R as a programming language and as a statistical analysis tool, and we'll also install a few useful R packages that we will continue to use throughout the book. By the end of this chapter, you will be able to: Install R packages for use throughout the book Use R as a calculator for basic arithmetic Utilize different data structures Control program flow by writing if-else, for, and while loops Import and export data to and from CSV, Excel, and SQL

Using R and RStudio, and Installing Useful Packages R is a programming language intended for use for statistical analysis. Additionally, it can be utilized in an object-oriented or functional way. Specifically, it is an implementation of S, an interactive statistical programming language. R was initially released in August 1993. It is maintained today by the R Development Core Team. RStudio is an incredibly useful Integrated Development Environment (IDE) for writing and using R. Many data scientists use RStudio for writing R, as it provides a console window, a code editor, tools to help create plots and graphics, and can even be integrated with GitHub to support version control.

Introduction to R

Chapter 1

While R does share some functionality with Microsoft Excel, it allows you to have more control over your data, and you can add on a variety of packages that allow statistical functionality out of the box—you won't have to build a formula to conduct a survival analysis; you can just install the survival package and use that! Ensure that you have R and RStudio installed on your system. RStudio will not work if you don't have R installed on your machine; they must be installed separately. Once they are both installed, you can open RStudio and use that without having an R window open.

Using R and RStudio Out of the box, R is completely usable. Open R on your machine. Let's use R for some basic arithmetic such as addition, multiplication, subtraction, and division. The following screenshot demonstrates this:

It also provides functions such as sum() and sqrt() for addition and calculation of the square root. The following screenshot shows this in action:

[6]

Introduction to R

Chapter 1

R can—and will—do basic arithmetic like a calculator, using symbols you're familiar with. One you may not have used before is exponentiate, where you use two asterisks, for example, 4 ** 2, which you can read as 4 to the power of 2. Once you want to start doing math beyond basic arithmetic, such as finding square roots or summing many numbers, you have to start using functions.

Executing Basic Functions in the R Console Let's now try and execute the sum() and sqrt() functions in R. Follow the steps given below: 1. Open the R console on your system. 2. Type the code as follows: sum(1, 2, 3, 4, 5) sqrt(144)

3. Execute the code. Output: The preceding code provides the following output:

Functions such as sum() and sqrt() are called base functions, which are built into R. This means that they come pre-installed when R is downloaded. We could build all of our code right in the R console, but eventually, we might want to see our files, a list of everything in our global environment, and more, so instead, we'll use RStudio. Close R, and when it asks you to save the workspace image, for now, click Don't Save. (More explanation on workspace images and saving will come later in this chapter.)

[7]

Introduction to R

Chapter 1

Open RStudio. We'll use RStudio for all of our code development from here on. One major benefit of RStudio is that we can use Projects, which will organize all of the files for analysis in one folder on our computer automatically. Projects will keep all of the parts of your analysis organized in one space in a chosen folder on your machine. Any time you start a new project, you should start a new project in RStudio by going to File | New Project, as shown in the below screenshot, or by clicking the new project button (blue, with a green plus sign

).

Creating a project from a New Directory allows us to create a folder on our drive (here

E:\) to store all code files, data, and anything else associated with the book. If there was an

existing folder on our drive that we'd like to make the directory for the project, we would choose the Existing Directory option. The Version Control option allows you to clone a repository from GitHub or another version control site. It makes a copy of the project stored on GitHub and saves it on your computer:

[8]

Introduction to R

Chapter 1

The working directory in R is the folder where all of the code files and output will be saved. It should be the same as the folder you choose when you create a project from a new or existing directory. To verify the working directory at any time, use the getwd() function. It will print the working directory as a character string (you can tell because it has quotation marks around it). The working directory can be changed at any time by using the following syntax: setwd("new location/on the/computer")

[9]

Introduction to R

Chapter 1

To create a new script in R, navigate to File | New File | R Script, as shown in the screenshot below, or click the button on the top left that looks like a piece of paper with a green arrow on it

.

Inside New File, there are options to create quite a few different things that you might use in R. We'll be creating R scripts throughout this book. Custom functions are fairly straightforward to create in R. Generally, they should be created using the following syntax: name_of_function