Woo, welcome to my notes for week four, the final week, of The Data Scientist’s Toolbox course from Coursera. After this first course, I will be moving on to the next course in the John Hopkin’s Data Science Specialization, which will hopefully finally start to involve more opinionated content and more of my own work.
Let’s get through the fourth week of The Data Scientist’s Toolbox first.
If you would like to catch up and read my first, second, or third week of notes, here they are:
Like always, disclaimers must be made for these types of posts, so here they are:
Number One: I am not an expert on any subject I present in my blog. I am writing these posts as I do the course, so anything I list down here is notes that worked for me. They may not be the best way to do or understand ideas, nor would they always be right. I do not have any university or professional experience backing me, it is just what’s working for me.
Number Two: For these notes, I am doing one blog post for each week. I will basically summarize what the teachers and professors say in a different way that may or may not be easier to understand and depending on the week and things being done, show my solutions and explanations for what I did.
Number Three: These posts are relevant as I write and take the course and maybe for some time after as well. Online courses like these do change quite frequently, and it is quite difficult for me to update these posts (Spending the money on the course again, retaking the course, etc. are all problems). However, if enough people find these posts useful and want an update, I will try my best to update and help everyone out.
Number Four: Although I just put down three disclaimers looking like I’m trying to protect myself from criticism, I am open to comments and thoughts on these posts. Comments on how my code should have been done, how I should explain things, etc. will help me and everyone reading as well.
From looking at the overview of week 4, it seems like this week I will learn to use some things like R Markdown and get an introduction to three incredibly essential concepts to data scientists: asking right questions, experimental design, and big data.
Let’s start with R Markdown.
R Markdown is a way of creating fully reproducible documents where both text and code can be combined. Since you can easily combine text and code chunks in one document, R Markdown makes it very easy to integrate things like introductions, hypotheses, what code you’re running, the results of your code, and your conclusion all in one document.
This makes sharing of the document very simple and allows people to reproduce your research precisely for what it was.
R Markdown also works very well with version control systems. It is easy to track what character changes occur between commits, and thus, it allows multiple collaborators or yourself to edit documents easily.
Another benefit of R Markdown is that it’s super easy to install! All you have to do is run install.packages(“rmarkdown”) in RStudio.
To create an R Markdown document, go to File > New File > R Markdown in RStudio. Fill in a title and an author and switch the output format to a PDF in the window that pops up. Then create the document.
The default template for R Markdown files has three main sections.
The first is the header at the top, bounded by the three dashes. This is where you can specify details like the title, your name, the date, and others.
Then you will next see text sections. For example, one section starts with “## R Markdown.” These sections will render as text when you produce the PDF of this file.
And finally, you will see code chunks. These are bounded by triple backticks and are pieces of R code that you can run right from within your document. When you create the PDF of this document, the code’s output will also show.
So then how do we actually get the PDF?
In R Markdown, you are supposed to “knit” your plain text and code into your final document. To do so, click on the “Knit” button along the top of the source panel and when you do so, it will prompt you to save this document as an RMD file.
You should then see something like this:
Now let’s get into some of the easy Markdown commands you can use.
To bold or italicize text, you surround it by two or one asterisks on either side respectively. To make headers, you put a series of hash marks in front of the words you want to make a header. The number of hash marks determines the level of the heading with one hash mark being header one and three hash marks being header three.
Another thing you can do with R Markdown is making a code chunk. To make an R code chunk, type in three backticks followed by the curly brackets surrounding a lower case R, then put your code on a new line and end the chunk with three more backticks.
Thankfully RStudio has provided shorts cuts, Ctrl+Alt+I (Windows) or Cmd+Option+I (Mac), to do this fast.
Another thing we can make with R Markdown is bullet lists. Lists are created by preceding each prospective bullet point by a single dash, followed by a space. Importantly, at the end of each bullet’s line, end with two spaces to remove spacing problems.
- **bold** (bold words)
- *italics* (italics)
- # Header One
- ## Header Two
- ### Header Three
If you would like to see what else you can make, make sure to check out this R Markdown cheatsheet!
Types of Data Science Questions
One of the most important aspects of data science is the questions we ask. There are 6 primary divisions of data science questions:
Let’s explore some of these categories.
Descriptive analysis: The goal of descriptive analysis is to describe or summarize a set of data. When you get a new dataset to examine, this is usually the first kind of analysis you will perform. The goal of this analysis is to generate simple summaries about samples and their measurements like measures of central tendency (e.g., mean, median, mode) or measures of variability (e.g., range, standard deviations or variance).
This type of analysis will not try to make conclusions. It is purely meant to show a general summary of the data you are looking at. An example of descriptive analysis can be seen in censuses. Governments collect a series of measurements on the citizens and then summarize them.
Exploratory analysis: Exploratory analysis is meant to explore the data and find relationships that weren’t previously known. This analysis may find how specific measures might be related to each other but do not confirm that the relationship as causative. Thus, exploratory analysis is good at finding new connections that you can drive future hypotheses and studies at.
Inferential analysis: Inferential analysis aims at using a relatively small sample of data to infer for say something about the population at large. For example, a study uses data collected from a sample of the US population to infer how air pollution might be impacting life expectancy in the entire US.
Predictive analysis: The goal of predictive analysis is to use current data to make predictions about future data. You’re mostly looking at patterns in history to try and predict what may happen in the future. An example of predictive analysis is Nate Silver of FiveThirtyEight who try to predict outcomes of U.S. elections and sport matches by using historical polling data, trends, and current polling.
Causal analysis: Many of the analyses we’ve looked at only see correlations and can’t get at the cause of the relationships we observe. Causal analysis, like the name suggests, focuses on looking at the cause and effect of a relationship.
These types of analysis are often the most challenging as you can’t necessarily always figure out the causal just from data. Many situations in life have too many variables that affect each other and thus, quite difficult to target the leading root cause.
Mechanistic analysis: Mechanistic analysis is not as common as the previous analysis because the goal is to understand the exact changes in variables that lead to exact changes in other variables. These analyses are complicated except in simple situations or in those that are nicely modeled by deterministic equations.
An example is a material science experiment that was examining how biocarbon particle size, functional polymer type, and concentration affected mechanical properties of the resulting “plastic.”
Now that you know and understand some of the analysis you may be doing in data science projects, you need to have the ability to design proper experiments to answer your questions.
Experimental design is organizing an experiment so that you have enough correct data to clearly and effectively answer your data science question. This process involves clearly formulating your question, designing the best set-up possible to gather the data to answer the question, identifying problems or sources of error in your design, and only then, collecting the appropriate data.
If you don’t plan out what you want to do and what you’re looking for and then do the wrong analyses, you can come to the wrong conclusions! While sometimes these wrong conclusions don’t matter too much, in high stakes environments such as medical research, the wrong conclusion could mean destroying a human population by giving them the wrong medicine.
To make sure that your experimental design is right, you need to understand some principles first.
Independent variable (factor): The variable that the experimenter manipulates. It does not depend on other variables being measured and is often displayed on the x-axis of a graph.
Dependent variable: The opposite of the independent variable. This variable is expected to change as a result of changes in the independent variable. It is often the y-value on a graph.
When you design an experiment, you must decide on which variables to measure as well as create a hypothesis which is an educated guess on what should happen between the two variables.
Once you have these factors set up, you must also consider sample size (sample size will be discussed in another course) and confounders which are extraneous variables that may affect the relationship between the dependent and independent variables.
For example, if we are looking at the relationship between shoe size and literacy, we may find that any relationship is due to age affects both shoe size and literacy.
We need to be able to control confounders in our experiments. In the example above, we can control age in our experiment by measuring age as well or by measuring only people of the same age.
In some types of experiments, a control group may be appropriate. This is when you have a group of experimental subjects that are not manipulated so that you can see the potential differences with a group of subjects that are manipulated.
For example, if we are testing drugs, we can look at a sample that doesn’t take any of the drugs versus a similar sample that takes the drugs and see the differences.
In these studies, confounding effects will also appear. One of the effects that could be present is the placebo effect where sometimes a subject can feel better about something by knowing they are receiving treatment from it.
To reduce more systematic errors in these types of studies, randomization of the subjects is crucial as well. By randomizing your groups, it will help lessen the risk of accidentally biasing one group to be enriched for a confounder.
Replication is another excellent strategy to reduce confounders’ effect. By repeating an experiment multiple times, you help reduce the chance that your data will be skewed from an outlying investigation. You will get more data, and it will allow you to see a relationship even better and more accurately.
Once you’ve collected and analyzed your data, you should also share your findings for other people to analyze and derive results. The Leek group has developed a guide that has great advice on sharing data on GitHub!
One more thing that data scientists should be aware of is p-hacking. One thing that is often reported in experiments is a value called the p-value. This value tells you the probability that the results of your experiment were observed by chance.
Often, when a p-value is less than 0.05, a result is considered significant. But beware, sometimes some data scientists try to manipulate the p-value by doing lots of tests to eventually find that one of the tests produces a meaningful result. Don’t be guilty of doing this!
Like mentioned in the first week of The Data Scientist’s Toolbox, big data is a field that is multiplying in popularity and size.
Here are some of the most essential qualities of big data once again:
The main advantages of big data nowadays are its ability to find more correlations due to the size of the data, limit error due to the amount of data there is, and analyze questions that we weren’t previously able to solve due to data variability.
However, like traditional structured data, we must still find datasets that are suited to our question. No matter how large a dataset may be, if it does not suit our problem, it cannot function for us as data scientists.
And that’s week 4 of The Data Scientist’s Toolbox course from Coursera. Before we entirely conclude off, I want to talk a little about the final assignment. Usually (or at least I anticipate) the final project will be much more complicated and have more diverse answers. However, the final project for this course is relatively simple.
I cannot post my answers here as it violates the Coursera Honor Code agreement, but there is a loop-hole in which I can post a link to my GitHub profile. However, I believe this learning should be serious and that you should only reference to it if you are stuck.
My GitHub profile link: https://github.com/kevinzshan.
And that’s it! Make sure you check out The Data Scientist’s Toolbox course from Coursera as everything in this post is referenced from it.
This course is also a part of the John Hopkin’s University Data Science Specialization from Coursera which I am taking so make sure you check that out too.
If you have any questions or comments, I would love to hear them in the comment section down below. Stay tuned for the next course in the specialization!