Welcome to my notes for the third week of The Data Scientist’s Toolbox course from Coursera. This course is a part of the John Hopkin’s Data Science Specialization on Coursera which I am taking
If you would like to read my first or second week of notes, here they are:
Once again, disclaimers must be made for these types of posts, so here they are:
Number One: I am not an expert on any subject I present in my blog. I am writing these posts as I do the course, so anything I list down here is notes that worked for me. They may not be the best way to do or understand ideas, nor would they always be right. I do not have any university or professional experience backing me, it is just what’s working for me.
Number Two: For these notes, I am doing one blog post for each week. I will basically summarize what the teachers and professors say in a different way that may or may not be easier to understand and depending on the week and things being done, show my solutions and explanations for what I did.
Number Three: These posts are relevant as I write and take the course and maybe for some time after as well. Online courses like these do change quite frequently, and it is quite difficult for me to update these posts (Spending the money on the course again, retaking the course, etc. are all problems). However, if enough people find these posts useful and want an update, I will try my best to update and help everyone out.
Number Four: Although I just put down three disclaimers looking like I’m trying to protect myself from criticism, I am open to comments and thoughts on these posts. Comments on how my code should have been done, how I should explain things, etc. will help me and everyone reading as well.
This week is focused on version control with GitHub and Git. We will learn a few things like why we should use version control and how we can actually implement version control into our projects.
Let’s get started with a basic understanding of version control first.
What is version control?
Version control is a system that records changes that are made to a file or a set of files over time. As you make edits, the version control system takes snapshots of your files and the changes and then saves those snapshots so you can refer or revert back to previous versions later if you need to.
It’s basically like saving a document on Word, then making a few changes and then saving the doc as a new file, except with Git, we can make this process much more efficient.
Apart from being able to record all previous versions and changes between each version efficiently, using version control makes it much easier to work with other people.
Version control software keeps track of who, when, and why those specific changes were made which allows you to see what exactly happened between everyone who is working on the projects.
The version control software that will be used in this course is Git. It is a free and open-source version control system and has become the most commonly used version control system.
To use Git effectively, we also have an online interface for it called GitHub. Git is software used locally on your computer to record changes while GitHub is a host for your files and the records of the changes made. It is a lot like DropBox.
There is also some vocabulary we should know when we work with Git. Here are some words we should understand (definitions and explanations all from the course):
Repository: Equivalent to the project’s folder/directory – all of your version-controlled files (and the recorded changes) are located in a repository. This is often shortened to the repo. Repositories are what are hosted on GitHub and through this interface you can either keep your repositories private and share them with select collaborators, or you can make them public – anybody can see your files and their history.
Commit: To commit is to save your edits and the changes made. A commit is like a snapshot of your files: Git compares the previous version of all of your files in the repo to the current version and identifies those that have changed since then. Those that have not changed, it maintains that previously stored file, untouched. Those that have changed, it compares the files, logs the changes, and uploads the new version of your file. We’ll touch on this in the next section, but when you commit a file, typically you accompany that file change with a little note about what you changed and why.
When we talk about version control systems, commits are at the heart of them. If you find a mistake, you revert your files to a previous commit. If you want to see what has changed in a file over time, you compare the commits and look at the messages to see why and who.
Push: Updating the repository with your edits. Since Git involves making changes locally, you need to be able to share your changes with the common, online repository. Pushing is sending those committed changes to that repository, so now everybody has access to your edits.
Pull: Updating your local version of the repository to the current version, since others may have edited in the meanwhile. Because the shared repository is hosted online and any of your collaborators (or even yourself on a different computer!) could have made changes to the files and then pushed them to the shared repository, you are behind the times! The files you have locally on your computer may be outdated, so you pull to check if you are up to date with the main repository.
Staging: The act of preparing a file for a commit. For example, if since your last commit you have edited three files for completely different reasons, you don’t want to commit all of the changes in one go; your message on why you are making the commit and what has changed will be complicated since three files have been changed for different reasons. So instead, you can stage just one of the files and prepare it for committing. Once you’ve committed that file, you can stage the second file and commit it. And so on. Staging allows you to separate out file changes into separate commits. Very helpful!
To summarize these commonly used terms so far and to test whether you’ve got the hang of this, files are hosted in a repository that is shared online with collaborators. You pull the repository’s contents so that you have a local copy of the files that you can edit. Once you are happy with your changes to a file, you stage the file and then commit it. You push this commit to the shared repository. This uploads your new file and all of the changes and is accompanied by a message explaining what changed, why and by whom.
Branch: When the same file has two simultaneous copies. When you are working locally and editing a file, you have created a branch where your edits are not shared with the main repository (yet) – so there are two versions of the file: the version that everybody has access to on the repository and your local edited version of the file. Until you push your changes and merge them back into the main repository, you are working on a branch. Following a branch point, the version history splits into two and tracks the independent changes made to both the original file in the repository that others may be editing, and tracking your changes on your branch, and then merges the files together.
Merge: Independent edits of the same file are incorporated into a single, unified file. Independent edits are identified by Git and are brought together into a single file, with both sets of edits incorporated. But, you can see a potential problem here – if both people made an edit to the same sentence that precludes one of the edits from being possible, we have a problem! Git recognizes this disparity (conflict) and asks for user assistance in picking which edit to keep.
Conflict: When multiple people make changes to the same file and Git is unable to merge the edits. You are presented with the option to manually try and merge the edits or to keep one edit over the other.
Clone: Making a copy of an existing Git repository. If you have just been brought on to a project that has been tracked with version control, you would clone the repository to get access to and create a local version of all of the repository’s files and all of the tracked changes.
Fork: A personal copy of a repository that you have taken from another person. If somebody is working on a cool project and you want to play around with it, you can fork their repository and then when you make changes, the edits are logged on your repository, not theirs.
Before we end this section off, there are some good habits you should establish when working with version control software like Git.
One of those habits is to make purposeful commits. Each commit should only address a single issue. This way if you need to identify when you changed a particular line of code, you only have to look at one place.
You should also write informative messages on why you made each commit. This will help anyone and yourself to identify the purpose of the change.
Finally, you should also frequently check that you are up to date with the current repo by repeatedly pulling. Additionally, you should push changes to the common repository once you have committed your files so you can share with your collaborators.
GitHub and Git
Like mentioned before, GitHub is a cloud-based management system for your version-controlled files. To get a GitHub account, you must sign-up at their website: https://github.com/. You will be brought to their homepage where you should fill in your information, make a username, put in your email, choose a secure password, and click “Sign up for GitHub.”
Once you log into GitHub, there should be a different homepage awaiting you. Here are some handy sections of the interface you may use:
One of the things you will be using is your user settings. Click on the icon with an arrow beside it in the upper right corner and go to “Your profile.”
This is where you control your account from and can view your contribution histories and repositories. When you’re starting out, you shouldn’t have any repositories or contributions yet so just edit your profile for now by clicking the “Edit profile” button on the lefthand side and then spend some time to fill out your name, a little description in the “Bio” box, and maybe even a picture of yourself too. When you are done, click “Update profile.”
There are more things in the account settings that you can change or update as well to your liking. One thing to be careful about though is changing your username. In the start, it should be fine but later on, when you have more repositories and contributions, changing your username might have unintended consequences.
Once you become more active on GitHub, you will start getting notifications of things like messages and conversations. To check those, click on the bell icon at the upper right-hand side.
Another great benefit of GitHub is its help system. Along the bottom of every page, there is a help button. If you ever need help, start by clicking that button and looking through the help files.
GitHub also has a mini tutorial to help you get started. Go through this guide now to create your first repository! When you’re done, it should look something like this:
Now take some time to explore around the repository. You can check out your commit history and see all of the changes that have been made to the repository, who made the change, when the change was made, etc.
After you have spent some time playing around with GitHub, it’s time to look into Git. To use Git with RStudio, we first need to download and install Git.
Installation Instructions For Windows:
- Go to https://git-scm.com/download
- Click the Windows download link.
- Open the .exe file and run the installation wizard.
- Follow the installation wizard by clicking next. Generally, the default option will be perfect.
- Then click “Install” and allow the wizard to finish.
- Then check the “Launch Git Bash” option, and unless you are curious, deselect the “View Release Notes” box, and click “Finish.”
- A command-line interface should open.
Installation Instructions For Mac:
- Go to https://git-scm.com/download
- Click the Mac download link.
- Open the DMG file.
- Double click on the .pkg file and an installation wizard will open.
- Click through the options, accepting the defaults.
- Click Install.
- Open up Git afterward, and a command-line interface should appear.
Now that Git is installed, we need to configure it for use with GitHub to prepare to link it with RStudio.
First, we need to tell Git what your username and email are. Go into the command prompt and type: git config –global user.name “username” with your desired username in place of “username.”
Following this, type: git config –global user.email email@example.com in the command prompt. Make sure you use the same email address you signed up for GitHub with!
After those two commands, you should be set for the next steps, but just to check, you can confirm your changes by typing: git config –list. In doing so, you should see the username and email you selected above. If you notice any problems or want to change these values, just retype the original config commands from earlier with the changes you’d like.
Linking GitHub and RStudio
To link GitHub and RStudio together, we need first to open up RStudio. Then go to Tools > Global Options > Git/SVN. Make sure that git.exe resides in the directory that RStudio has specified. Then click OK or Apply and RStudio is linked with Git.
In the same RStudio option window, click “Create RSA Key” and when this completes, click “Close.” After doing this, click “View public key” in the same window and copy the string of numbers and letters. Then close the window again.
You have now created a key that is specific to you, which will be provided to GitHub.
Next, you want to go to github.com/, login if you are not already, and go to your account settings. Then head over to “SSH and GPG keys” and click “New SSH key.” Paste in the public key you have copied from RStudio into the Key box and give it a title related to RStudio. Confirm the addition of the key with your GitHub password.
GitHub and RStudio are now linked. Now let’s create a new repository and test the link.
On GitHub, create a new repository. Then copy the URL for the new repository.
Go to RStudio a create a New Project. Select Version Control and Git as your version control software. Paste in the repository URL from before, select the location where you would like the project stored. When done, click on “Create Project.” Doing so will initialize a new project, linked to the GitHub repository, and open a new session of RStudio.
The next step is to create a new R script and copy and paste some of the following code:
- print(“This file was created within RStudio”)
- print(“And now it lives on GitHub”)
Save the file. Note that when you do so, the default location for the file is within the new Project directory you created earlier.
After saving your script, go to the Git tab of the environment quadrant in RStudio, and you should see the file that you just created. Click the checkbox under “Staged” to stage your file.
Then click “Commit.” A new window that lists all of the changed files from earlier should open. In the upper quadrant, in the “Commit message” box, write yourself a commit message, and click “Commit.”
The final step is to push your commit to the GitHub repository. To this, click on “Push” and enter whatever login details you have should you need to. And just like that, you’re done!
Projects Under Version Control
In the last section, we linked RStudio with Git and GitHub by creating a repository on GitHub and linking it to RStudio. Sometimes, however, you may already have an R Project that isn’t yet under version control or connected with GitHub. Let’s fix that!
First, let’s set up a situation where RStudio is not linked with GitHub. Go to File > New Project > New Directory > New Project, name your project, and click create project. This creates a project not under version control.
To sync it up with version control, open Git Bash or Terminal and navigate to the directory containing your project files by typing cd ~/dir/name/of/path/to/file to move around.
When the command prompt in the line before the dollar sign says the correct directory location of your project, you are in the right place. Once here, type git init followed by git add . which initializes this directory as a git repository and adds all of the files in the directory (.) to your local repository. Commit these changes to the git repository using git commit -m “Initial commit.”
Then head over to GitHub.com, and again, create a new repository. Here are two things you need to make sure of:
- Make sure the name is the exact same as your R project
- Do NOT initialize a README file, .gitignore, or license.
Once you create the repository, you should see a page like this:
Copy and paste the lines of code from the “Push an existing repository from the command line” to Git. Then refresh your GitHub page, and it should now look something like the image below.
Sometimes you may need to work on existing projects that others are working on. To do this, you follow the exact same process as the last lesson where you created a GitHub repository except using the link that the other people provide.
And that’s Week 3 of The Data Scientist’s Toolbox course from Coursera. Make sure you check the course out!
This course is also a part of the John Hopkin’s University Data Science Specialization from Coursera which I am taking so make sure you check that out too.
I hope you enjoyed today’s week of notes. If you have any questions or comments, I would love to hear them in the comment section down below. As always, have a great day!