Welcome to my notes for the first week of The Data Scientist’s Toolbox that is a part of the John Hopkin’s Data Science Specialization on Coursera.com. As this is the first time I’m taking an online course and noting it down on a blog, there must be a few things that should be said.
Number One: I am not an expert on any subject I present in my blog. I am writing these posts as I do the course, so anything I list down here is notes that worked for me. They may not be the best way to do or understand ideas, nor would they always be right. I do not have any university or professional experience backing me, it is just what’s working for me.
Number Two: For these notes, I am doing one blog post for each week. I will basically summarize what the teachers and professors say in a different way that may or may not be easier to understand and depending on the week and things being done, show my solutions and explanations for what I did.
Number Three: These posts are relevant as I write and take the course and maybe for some time after as well. Online courses like these do change quite frequently, and it is quite difficult for me to update these posts (Spending the money on the course again, retaking the course, etc. are all problems). However, if enough people find these posts useful and want an update, I will try my best to update and help everyone out.
Number Four: Although I just put down three disclaimers looking like I’m trying to protect myself from criticism, I am open to comments and thoughts on these posts. Comments on how my code should have been done, how I should explain things, etc. will help me and everyone reading as well.
Disclaimers are finally over, let’s get into the actual course notes.
I skipped the welcome section because there wasn’t anything that worthwhile. Basically just a welcome and an explanation of why the course is using automated videos that have these robotic narrators. Pretty interesting but not much to take notes on.
Worthwhile things do actually start in the next section: What is Data Science?
What is Data Science?
Data science is a broad field because at its core, it’s using data to answer questions.
Everyday things that data scientists may do daily can involve statistics, computer science, math, data cleaning plus formatting, and data visualization. There are even separate jobs that one can get into focusing on each of these aspects, such as data engineers, machine learning engineers, etc.
An Economist Special Report sums up a broad data scientist’s skills pretty well:
“A data scientist is someone who combines the skills of a software programmer, statistician, and storyteller/artist to extract the nuggets of gold hidden under mountains of data.”
Drew Conway also has a Venn diagram that sums up the exact skills a data scientist should have too. Here’s the picture of the Venn diagram:
Basically, a data scientist skills is a combination of computer science prowess or in other words, hacking skills, math and statistics knowledge, and knowledge in the field that they are working in for example business, sports, finance, etc.
Programming skills or at least programming with R in this course are needed to access data, manipulate the data, and then present the data. Math and statistics are required to be able to analyze the data and derive solutions out of the data. Substantive expertise is needed to be able to ask the right questions and find solutions that benefit the field of knowledge a data scientist could be working in.
There are many places where data scientists can function. One of the examples is Daryl Morey, the general manager of the Houston Rockets basketball team in the United States. Although Daryl Morey did not have a strong background in basketball, his ability to collect and analyze data and then use the data to make informed hiring decisions coupled with a bachelor’s degree in computer science and an MBA from M.I.T., he was chosen for the job as general manager.
Another data scientist example is Hilary Mason. She is a co-founder of FastForward labs (a machine learning company recently acquired by Cloudera, a data science company) and is the Data Scientist in Residence at Accel. Hilary Mason uses data to answer questions about mining the web and understanding the way humans interact with each other through social media.
There are a lot more examples of data science out there. It is a field that is still growing and has the capabilities to do amazing things. Some people predict the presidencies using data science, others can find flu outbreaks based upon 46 words, etc.
And the data science field is only growing! We need data science because of the vast amount of data that is currently available and being generated. Rising abilities of computers and software have also allowed data scientists to use and analyze data much more effectively.
Another example of how much the data science field is growing can be shown from the area of big data.
A few qualities characterize big data. As the name suggests, big data involves large datasets. How large? Like 300 hours of video being uploaded every minute large.
The second quality of big data is the velocity of the information being generated. Data is being created and collected faster than ever before. Think of all the times you can track things in real time like where the package you ordered is, you can analyze that data if you had the tools and knowledge.
The third quality of big data is the variety. There are lots of different types of data available to us. In the video example above, we can look at things like audio and video, which is unstructured or look at things like view count, the number of comments, or the number of likes, very structured data.
With so much data around the world, the demand for data scientists is enormous! Machine learning engineers, data scientists, and big data engineers were among the top emerging jobs in 2017, according to LinkedIn.
Quote, “Data scientist roles have grown over 650 percent since 2012, but currently 35,000 people in the US have data science skills, while hundreds of companies are hiring for those roles – even those you may not expect in sectors like retail and finance – supply of candidates for those roles cannot keep up with the demand.”
Additionally, Glassdoor ranked the Data Scientist job as the number one job in America in 2017 based on job satisfaction, salary, and demand.
This is why learning data science could potentially be one of your best moves for your career.
What is Data?
Definitions of “data”:
Cambridge English Dictionary: Information, especially facts or numbers, collected to be examined and considered and used to help decision-making.
Wikipedia: A set of values of qualitative or quantitative variables.
Both sources agree that data is values or numbers or facts, but the Cambridge definition focuses on the actions that surround data which are the most essential parts of a data scientist while Wikipedia focuses on what data actually includes.
If we look at the Wikipedia definition, three components can be split up and focused on individually; “a set of values,” “variables,” and “qualitative and quantitative measurements.”
A set of values is a collection of items. A set of values is also often referred to as the population in statistics. The set of things is what you’re trying to discover something about or in general, the things you will make measurements on.
Variables are measurements or characteristics of an item. For example, you could be measuring the height of a person, the amount of time a person stays on a website, what a person clicks on a website, what gender they are, etc.
Lastly, we have to differentiate the variables into qualitative and quantitative groupings. Qualitative variables are information describing an item’s characteristics usually through words. For example, country of origin, sex, or treatment group. Quantitative variables, on the other hand, are information describing quantities. These are usually characterized by numbers and are things like height, weight, and blood pressure.
Now that we have gone through a few definitions of what is data, it is time to show off a few examples of data sets.
When we envision data, we often think of beautiful tables with headings describing what everything is below and ordered in helpful grids such as the picture shown below.
But as a data scientist, data will often come to us in more messy forms, and it is the data scientist’s job to clean and order the data in a more readable way.
Examples of messy data include:
- Sequencing data
- Population census data
- Electronic medical records (EMR), other large databases
- Geographic information system (GIS) data (mapping)
- Image analysis and image extrapolation
- Language and translations
- Website traffic
- Personal/Ad data (e.g., Facebook, Netflix predictions, etc.)
Before this section is ended, it must be emphasized that data is of secondary importance. A good data scientist must first ask the question before looking for data as a data scientist with data, but no questions cannot do anything.
Knowing how to solve problems is a crucial skill for a data scientist but also for many other jobs. It is also a skill that will come handy in this course.
Before you ask for help, you should try to solve the problem on your own first. There are a few steps you can take.
- Read the manuals or help files
- Search Google and relevant forums, StackOverflow and CrossValidated are common forums for data science
For coding problems, they commonly fall into two categories; either your command produces no data and spits out an error message, or your command produces an output, but it is not at all what you wanted.
If your problem produces an error message, here are some strategies you can use:
- Check your code for typos
- Read the error message and make sure you understand it
- Google the error message, exactly
If your problem gets an output that you didn’t expect:
- Consider how the output was different from what you expected
- Think about what it looks like the command actually did and why it would do that instead of what you wanted
All of the above will solve almost all your problems relatively quickly and with ease.
Another way you can try to solve your problems is through a method called “rubber duck debugging.” In this method, you should explain your coding problem to someone else or an inanimate object, and by describing what the code is supposed to do and observing what it actually does, you may indeed find yourself upon the solution.
But let’s say you tried all of the steps above and you are still stuck. Then it is time to ask other people.
Other than asking people that are knowledgeable in data science that you know, you can ask on forums. But there are some rules you should follow when posting to forums. First, how to ask your questions. You should make your question as specific as possible. Here are some details to include:
- The question you are trying to answer
- How you approached the problem, what steps you have already taken to answer the question
- What steps will reproduce the problem (including sample data for troubleshooters to work from!)
- What was the expected output
- What you saw instead (including any error messages you received!)
- What troubleshooting steps you have already tried
- Details about your set-up, e.g., what operating system you are using, what version of the product you have installed (e.g., R, R packages)
Be specific in the title of your questions as well!
On forums, you should also follow forum etiquette. People are taking time out of their day to help you, and as such, you should be as helpful as you can yourself and courteous. Here are some things you should do:
- Read the forum posting guidelines
- Make sure you are asking your question on an appropriate forum!
- Describe the goal
- Be explicit and detailed in your explanation
- Provide the minimum information required to describe (and replicate) the problem
- Be courteous! (Please and thank you!)
- Follow up on the post OR post the solution
Here are some things you shouldn’t do:
- Immediately assume you have found a bug
- Post homework questions
- Cross post on multiple forums
- Repost if you don’t immediately get a response
And that’s it! Remember always to try and work out a solution to a problem the best you can before asking others.
The Data Science Process
After covering what data science is and what data is, we should also cover how to go about a data science project. Here are the steps to a data science project process:
- Form the question
- Find or generate the data to be used
- Analyze the data first by exploring and then often modeling the data (using statistical analysis and perhaps machine learning)
- Communicate your findings to others (e.g., presentation, blog post, report to a boss, etc.)
An example of a data science project is from a data scientist named Hilary Parker and her project titled, “Hilary: the most poisoned baby name in US history.”
First, Hilary defines her question. Since this is the most critical part of a data science project, it should be well-defined. Hilary does this by bolding the question in her blog post. This is her question that she was trying to answer, “Is Hilary/Hillary really the most rapidly poisoned name in recorded American history?”
The next step is to find and gather data. Hilary uses the Social Security website to collect data and uses a dataset that included the 1,000 most popular baby names from 1880 until 2011.
With the data in hand, Hilary then proceeds to explore and model the data. One of the first things she does is look at the names with the most significant drop in percentage from one year to the next. Then looking at what she found in the table, she proceeds to plot the data to analyze the names further.
Then finally she found her conclusion in a plot that showed that Hilary had the definitive fastest fall of any girl baby name between 1880 and 2011.
Then most importantly, she communicates her findings to the world through her blog post.
Additionally, there are a few things that go into data science projects that should also be noted. One thing to note is that you should give credit where credit is due. For example, in Hilary’s project, she linked to a blog post where she was basing her question off of, a Social Security website where she got the data, and where she learned about web scraping.
Another thing that may not come out in the final results but are important nonetheless to the project is finding the method of analyzing the data. With so much data out there, it can be hard to decide on how you will solve your problem. Looking at what information is essential and what data isn’t is all apart of the process. However, this might not be something you will show in your final project.
And with all of the above, you should be on your way to 100% for week 1 of The Data Scientist’s Toolbox. Remember that anything listed here is what I learned from the course myself and you should take the course yourself for the actual material.
The Data Scientist Toolbox course on Coursera is also a part of the specialization, Data Science from John Hopkin’s University, which can found on Coursera as well.
If you have any questions on anything I talked about here, please leave a comment in the comment section below. As always, go learn something new and have a great day!