Hello and welcome to my notes for the second week of The Data Scientist’s Toolbox course from Coursera. This course is a part of the John Hopkins Data Science Specialization on Coursera which I am taking.
If you would like to read the first week’s notes, here is the link to it.
Again, disclaimers must be made for these types of posts, so here they are:
Number One: I am not an expert on any subject I present in my blog. I am writing these posts as I do the course, so anything I list down here is notes that worked for me. They may not be the best way to do or understand ideas, nor would they always be right. I do not have any university or professional experience backing me, it is just what’s working for me.
Number Two: For these notes, I am doing one blog post for each week. I will basically summarize what the teachers and professors say in a different way that may or may not be easier to understand and depending on the week and things being done, show my solutions and explanations for what I did.
Number Three: These posts are relevant as I write and take the course and maybe for some time after as well. Online courses like these do change quite frequently, and it is quite difficult for me to update these posts (Spending the money on the course again, retaking the course, etc. are all problems). However, if enough people find these posts useful and want an update, I will try my best to update and help everyone out.
Number Four: Although I just put down three disclaimers looking like I’m trying to protect myself from criticism, I am open to comments and thoughts on these posts. Comments on how my code should have been done, how I should explain things, etc. will help me and everyone reading as well.
R is both a programming language and environment focused mainly on statistical analysis and graphics. It is downloaded from the Comprehensive R Archive Network, or CRAN, which is also the place where you will download many R packages.
Before we get into installing R, here are some reasons why you should use the R programming language.
- Popularity- R is one of the top five programming languages asked for in data scientist job postings, which makes it a skill highly valued by data scientist employers. As it is also one of the most popular programming languages, the quicker new functionality gets developed, the more powerful it becomes, and the better support there is.
- Cost- R programming language is completely free! That means you don’t have any cost barrier to get started.
- Extensive functionality- Although R is mainly known for its statistical analysis and graphics, there are many things you can do with R from making websites, making maps using GIS data, analyzing language, etc. There are often packages out there, allowing you to do many things with R.
- Community- As mentioned in the popularity reason, R has a large community developed around it. It’s easy to get help, there are lots of packages continually being designed, and pages and pages dedicated to R on different forums.
Let’s get started with actually installing R.
For both Mac and Windows, the download links are at the CRAN homepage: https://cran.r-project.org/
Download Steps For Windows:
- Head over to this link- Download R for Windows
- Go to the base distribution
- Click on the link at the top of the page that should say something like “Download R [version number] for windows.”
- Open the executable file that should get downloaded after clicking the link and run it
- Select the language you prefer and agree to license information
- Next, it will prompt you for a destination location. It will likely default to Program Files, and it is usually fine unless you have any issues with it.
- Then it will prompt you to select which components should be installed. Installing all of the components is fine.
- Next, it will ask you about startup options, and again, default options should be fine.
- When it asks you if you want shortcuts placed, it is entirely up to you, however, do not change anything under registry entries.
- After that, the installation should start and then test it but opening R!
Download Steps For Mac:
- Head over to this link- Download R for (Mac) OS X
- Click the most recent version of R and a .pkg file should be downloaded.
- Open the .pkg file and follow the prompts provided by the installer.
- Click continue on the welcome page and again on the important information window page.
- Accept the software license agreement and then continue.
- Then it will ask you to select a destination for R in which the default is usually fine unless you would like to choose something else.
- Once the installation is completed, find R and open it up for the first time!
Installing R Studio
After installing R, we can technically start to run everything on it. However, there are more ways to interface with R, which is why we are looking at R Studio. With R Studio’s graphical interface, it makes it a lot easier for us to manage the R programming language as beginners. There are also many other functions of R Studio that will be explored later.
But let’s get into installing R Studio first.
Download Steps For Windows:
- Go to the R Studio download page- R Studio Download Page
- Click on the RStudio Desktop Open Source License download button.
- Select the RStudio installer depending on which Windows edition you are on (Vista, 7, 8, 10) and download
- Open the executable file when the download is complete and allow it to make changes to your computer if presented with a security warning.
- On the welcome screen, click next unless you want to change the installation path of RStudio.
- Click next on the page you are brought to.
- Choose whether you want to create shortcuts or not and click “Install”
- Once the installation is complete, click finish and try opening RStudio!
Download Steps For Mac:
- Start in the same way as the Windows steps by going to the R Studio download page- R Studio Download Page
- Click on the RStudio Desktop Open Source License download button.
- This time, select the Macs OS X RStudio installer (Mac OS X 10.6+ (64-bit)).
- Once the download is complete, click on the downloaded file and it will start to install.
- When the installation is finished, the applications window will open.
- Drag the RStudio icon into the Applications directory and test your installation by opening RStudio.
Explore the RStudio interface for a little bit!
R Studio Tour
Now that R Studio is installed, here’s a brief tour of the interface.
When you open R Studio, you should see a screen roughly divided into four quadrants, each with specific and varied functions as well as the main menu bar.
You may be missing the upper left quadrant and instead, have the left side of the screen with just the Console region. In this case, go to File, then click New File, then click R Script and it should look more like the image above. All of these quadrants’ sizes can be changed by dragging and dropping the divider between the quadrants.
Here’s a picture highlighting each significant section.
Let’s start with the menu bar first.
The menu bar runs across the top of the screen and should have two rows. The first row should be reasonably standard with things like “File” and “Edit.” Below that, there is a row of icons that are shortcuts for functions you will frequently use such as New File, New Project, Open File, Save, Save As, etc.
In the file option on the menu, things that we commonly use include New File, Open File, New Project, Save, and others. While hovering over the new file option, you will also see various file formats that can be opened, such as R notebooks, web apps, websites, and others. The most common file types used, however, are R Script and R Markdown.
Another useful option on the menu is Session. Session contains functions such as terminating R and restarting R, which are very useful if you run code that doesn’t stop or produces an unintended bug.
Let’s get into the console now.
The console should be familiar to you. When you opened the R programming language from before, what you saw was the console. Here you can type in commands and code that you can run to do data science things.
Any dataframe or matrix you create on the console will appear in the environment quadrant. RStudio will also tell you some information that can be important about the object in the environment, such as whether it a list or dataframe you created, or what type of data it contains. This is useful for later when we use functions that only work with certain classes of data.
The environment quadrant also has two other tabs. One is the history tab, and the other is connections.
With the history tab, you will see commands that you have run in your session of R. If you click on any command, you can click “To Console” or “To Source” and this will either rerun the command or move the command to the source respectively.
Let’s get into the source/script editor panel next.
The source/script editor panel is likely the panel you will spend most of your time in RStudio. This is where you can write and store R commands that you want to save for later for record-keeping or rerunning.
The final panel we will look at is the Files/help/plots/packages panel in the bottom right of the RStudio window. Five tabs are running across the top: Files, Plots, Packages, Help, and Viewer.
In files, you can see all of the files in your current working directory and also change it if you should desire to.
The plots tab will show any plots that you generate with your code. The zoom function will open the plot in a new window, and the export function allows you to save the plot as an image or PDF.
The packages tab is where you can see the packages you have installed, load and unload these packages, and update them.
And finally, the help tab is where you can find documentation for your R packages and various functions and see more things that could help you with your projects. There is a handy search bar you can use to find answers to many of your questions.
One of the things that make R such a versatile programming language is its packages. Base R, or everything included in R when you download it has rather basic functionality for statistics and plotting which can be limiting at times. To expand R’s functionality, people have developed packages we can use and deposited them in repositories.
A package is a collection of functions, data, and code conveniently provided in a friendly, complete format for you. Currently, there are, or at the time the course was updated/created, there are just over 14,300 packages available to download.
To find and download these packages, we must look for and download them in repositories.
Here are the three significant repositories for R:
- Comprehensive R Archive Network (CRAN): R’s central repository with over 12,100 packages available
- BioConductor: A repository mainly focused on bioinformatic packages
- GitHub: A very popular, open-source repository although not R specific
Because there are so many packages out there, it can often be challenging to find what we are looking for. Other than just using what people have told you to use, there are a few methods to find the package that you want.
In CRAN, all the packages are grouped by their functionality/topic into 35 “themes.” It calls this its “Task view,” which allows you to narrow the packages you can look through to a topic relevant to your interests.
You can also look at a website called RDocumentation. This is a search engine for packages and functions from all three major repositories; CRAN, BioConductor, and GitHub. It also has a “task view” like CRAN as well.
One of the most common ways to find packages that you might want, however, is by searching Google. There are often forums, blogs, or other tutorials with people already doing what you are doing and thus, having the packages you need as well.
Once you found your packages, installing them is your next step. Well actually, before you install, you should check your version of R and see if the package requires a particular version of R to run.
To check your version of R, you can see it when you first open R or RStudio in the console. You can also type in version into the console, and it will output the information on the R version you are running. Another helpful command is sessionInfo() which will tell you what version of R you are running along with a listing of all the packages you have loaded.
It is a great detail to include when posting a question to forums. Let’s get back into installing packages though.
If you are installing from CRAN, you can use the install.packages() function, with the name of the package you want to install in quotes between the parentheses.
For multiple installations, add in a character vector like this: install.packages(c()).
You can also use the graphical interface by clicking tools, then install packages…, then install from CRAN, then choose the packages, and click install.
With Bioconductor, first, you need to use this code: source(“https://bioconductor.org/biocLite.R”) to get the essential functions required to install BioConductor. This code allows you to use biocLite() command where you will put the package you want to install in quotes between the parentheses of the code to install the package you want.
With GitHub, you must find the package and take note of both the name and the author of the package. Then follow these steps to install packages:
- install.packages(“devtools”) – only run this if you don’t already have devtools installed.
- install_github(“author/package”) – replace author and package with their GitHub username and the name of the package
Once you have installed the packages you want, you must load them to be able to run them. To do this, you must use the library() function. When you are using this function, do not put the package name in quotes!
Sometimes you may encounter something called a dependency. These are packages that require other packages to be loaded first, and you should check out that package’s manual/help pages to find out.
With the graphical interface from RStudio, you can also make loading in a package effortless. Go to the package tab in the lower right quadrant and just check the checkbox beside the package name that you want.
Throughout time, you might need to update, remove, or unload packages.
To check what packages need an update, you can use the function old.packages(). This will bring up all packages that have been updated since you installed them/last updated them.
To update all packages, use the code update.packages(). To update a specific package, use install.packages(“packagename”), replacing packagename with what your package’s name is.
In the RStudio interface, you can click update in the packages tab which gives you to option to update all packages or select specific ones.
It is essential to check in on your packages and check if you have fallen out of date. Although sometimes an update can change the functionality of certain functions, so if you rerun some old code, the command may be changed or perhaps even outright gone in which you will need to update your code too!
Sometimes you want to unload a package in the middle of a script. A reason for this is a package you loaded in may not play nicely with another package you want to use.
To unload a package, you can use the detach() function. In the RStudio interface in the packages tab, you can also unload a package by unchecking the box beside the package name.
At times, you may also want to remove packages. To do this, simply use the function remove.packages(). You can also click the “X” at the end of a package’s row in the packages tab to uninstall packages.
Of course, you will also need to know how to use a package’s functions!
To check what functions are included within a package, use the help() function. This will give access to a package’s help files that you can read through. Many packages also include “vignettes” which are basically extended help files that include not only an overview of the package and its function but also detailed examples on how to use the functions. You can look at the vignettes in a package with the browseVignettes() function.
Projects In R
If you become a data scientist or will frequently use R in the future, R Projects are great to keep all your related files together.
When you make a project in R, it will create a folder where all files will be kept, which helps organize yourself and keep multiple projects separate from each other. When you re-open a project, RStudio remembers what files were open and will restore the work environment much like opening your computer from sleep mode. This is just another one of the benefits RStudio provides.
RStudio also allows you to share projects much easier. Since everything is in the same place, you can directly share access to folders/files or by associating it with version control software.
There are three ways to create a project:
- From scratch which will create a new directory for all your files to go in
- From an existing folder which will link an existing directory with RStudio
- From version control which will “clone” an existing project onto your computer
Creating a project from scratch is what you will often be doing. Here are the steps to do so:
- Open RStudio
- Under File, select “New Project”
- Select “New Directory”
- Select “New Project”
- Pick a name for your project
- Select where you want the folder to be created
- Click “Create Project”
You should see a new blank RStudio interface when the new project is created and opened.
To open the project, you can simply double click the .Rproj file on your computer. You could also go to RStudio then to File then to Open Project and then select the one you want to open.
Closing the project is also quite simple. You could simply close your RStudio window or go to File and click Close Project.
Sometimes, we will also want to switch between Projects. In the Projects toolbar, click on the drop-down menu and choose “Open Project” and find the new Project you wish to open. This will save the current Project and close it, then it will open the new Project in the same window.
If you want multiple projects open at the same time, do the same but select “Open Project in New Session” instead.
There are also a few “best” practices you might want to follow. Of course everyone has their own style of doing projects, but generally, file structures are set-up around having a directory containing the raw data, a directory containing the scripts/R files, and a directory for the output of your code. It can be useful to create these folders at the start.
And that’s what I got for Week 2 of The Data Scientist’s Toolbox course from Coursera that is a part of the John Hopkins University Data Science Specialization. Quite a long post, indeed. Hopefully, the next posts will be shorter… Although it’ll probably be longer haha…
Everything I learned and have written here comes from The Data Scientist’s Toolbox course from Coursera.
I’ll be taking the entire John Hopkin’s University Data Science Specialization from Coursera as well so make sure you check that out too.
I hope you enjoyed today’s week of notes. If you have any questions or comments, I would love to hear them in the comment section down below. As always, have a great day!