Data Science Tutorial for Beginners
In this tutorial, we will learn about data science in an easiest and versatile way. At the end of the course we will be able to uncover what is the meaning of data analysis, information and tools involved, data analysis models, what are the different business model instruments.
We will find the contrast between data analysis and Big Data. How Big Data is revolutionizing data analytics? Big Data is as simple as it sounds; big>small; the huge bulk of data which are produced every day. With the huge bulk of data, basic Microsoft tools couldn’t contain in itself everything under the sun. Tools had to get “BIG” as the data it involves. In this way, improvement of new tools turned into a need to manage Big Data. This prompted the introduction of tools such as Python, Hadoop, SAS etc.
You must have heard of Flipkart’s “Big Billion Days” and Amazon’s “The Great Indian Festival”. The huge bulk of orders and their processing could be problematic. To examine what are users preferences, which set of orders are more popular than any, data science is useful. They can be solved by utilizing normal regression techniques and algorithms. One can apply data science methods to it and discover trends about a particular industry. These regressions are useful for business sales and predicting long-term growth. We will not go too much mathematical. Let us uncover the complexity;
Let’s take the example of Flipkart and Amazon again if we want to As we can see there are 5 major types of regressions. But they are not limited as one can make their own regression methods. You will also be able to do that once you master data science. Out of all these, linear regression is the most widely used.
- predict trends for oncoming demands
- cart price optimization,
- inventory management
- categorize loyal customers
- provide the systematic shopping experience
Linear regression forecast a linear relationship between a single entity X and a quantitative response Y
Y = β0 + β1X + ε
where ε is a random error term, which is independent of X and has a mean 0
à β0, β1 are the model coefficients or parameters
à Example: X is a specific item for e.g. Iphone 8 and Y is sales
sales = β0 + β1 × Iphone + ε
àBy using data to get estimates β0, β1. Prediction for X = x is then given by
yˆ = βˆ0 + βˆ1x
Estimating the Coefficients
àUsed data: (x1, y1),…,(xn,yn)
àIn Advertising data set, n = 200 different sellers.
àGoal: Find βˆ0, βˆ1 such that yi ≈ βˆ0 + βˆ1xi , for all
i = 1, . . . , n
Regression Analysis by Simple/Multiple Linear Regression
- A relationship between Dependent (Output) variables and Independent (Input) variables
- Multiple Regression: More variables, or transformations/high order extensions of the same input
- Examples: Sales: Population density and number of customers
For multiple regression, a value must be greater than 1
Independent variable is x and dependent variable is y, so we can establish a straight line also known as regression line.
It is characterized by an equation:
Y=a+b*X + e,
where a denotes intercept, b denotes slope of the line and e denotes the error term.
Methods used in Data Science
Data Science with Python
Python is the most popular programming languages and most often used in data analysis. Python is an open source, object-oriented language. The common myth is- “For becoming a data scientist, you need to have an excellent command over Python” which is so not true. Most of the bulky books on Python are for the general readers who want to sharpen their command on Python programming, But here, your intention is to use Python for ‘Data Sciences’.
Several libraries have made it simple for executing Python programs. Many of the functions and procedures are inbuilt. Most used are Panda and SciPy.
These are used for general purpose data analysis. You can select columns, load your data into data frames, filter for specific values, group by values, run functions (sum, mean, median, min, max, etc.), merge data frames and so on. You are able to create multi-dimensional data-tables. Fun fact is- you can’t do this with Python.
Using calculations with Panda is so cool and easy. Let’s try it out yourself
We have seen through the above example how by assigning a value to x, we can simply do addition, subtraction, multiplication, exponentiation, find remainder and do division functions easily.
Data Science with Hadoop and SAS:
Hadoop is another open-source framework for storing data and running applications to provide complete solution from processing to visualization of the data. It provides huge storage for any kind of data, powerful processing power and helps in fastening processing of data to handle limitless concurrent tasks.
SAS, on the other hand is used for statistical analysis. It provides output in the form of tables, graphs, and as RTF, HTML and PDF documents from common spreadsheets and databases. Be warned that SAS can only be used if you are using SAS products. However, there are no restrictions as such with Hadoop.
Hence, we can conclude that data science is the combination of statistics and computation. In order to analyze a huge chunk of data, computational methods are required statistically. The digital revolution has spawned technology; the large and varied set of data could be processed through data sciences methods.