What is Skewness?
Skewness is majorly used to determine the asymmetry or imbalance caused in a sample probability distribution. Basically, skewness explains to us to what extent there is tilt in data towards one side, and how it differs from a perfect Normal distribution, skewness can occur either side.
A distribution is called Normal Distribution when the data is perfectly symmetrical and forms a nice “bell-shaped curve”, with the same value as mean, median, and mode. And either side tails of the distribution will be perfectly equal which means skewness is exactly Zero. On a lighter note, it is almost impossible to find normally distributed data in the real-world. Before speaking on the skewness, we need to know about basic central tendencies and standard deviation which are used to understand the data:
Mean is nothing but average of any data given, it can be simply said as the sum of total values divided by the total number of values.
Median in exactly the central value of data. It is calculated by arranging the data in ascending order and get the value of the middle index [(N+1)/2]
In case if there are Even a number of points in data Mean of two middle indexes would be considered as the median.
mode is nothing but most repeated value, majorly used with categorical variables.
How much on an average square root of squared distance does my data lies away from the mean, basically explains the consistency of the data around mean.
If there is skewness in data all the 3 central tendencies mentioned above would differ from each other based on the type of skewness. There are two kinds of Skewness,
Positively skewed Distribution:
Mean in a Positively skewed Distribution will be highly affected by the values at extreme right-side values of the distribution hence mean would turn out to be more than that of the median and mode in this distribution. This distribution is also called as rightly skewed, in this distribution Mean > Median > Mode.
Negatively skewed Distribution:
In contrast to that of Positively skewed Distribution this would have most of the values clustered at right side part tail but few extreme small values in the left tail which can affect the mean the fall down to the left side of the peak. This distribution can also be called left-skewed distribution, central tendencies of this distribution follow the below order Mean < Median < Mode
Example to understand the skewness,
Suppose let’s say that your company gives you performance bonus to you every quarter of Rs.10000 but its variable, assume that you are being worked with the company for past 15 years and you build a distribution with all the variable pay that you got being in the company,
If the distribution plotted turns out to be a positively skewed one we can attest that many times you have got less variable pay compared to mean, but very few times you have got highest of the variable pay’s, but this makes you happy because values on right side of distribution helps you to earn more than what company has promised you on an average.
Note: just a reciprocal of this example can be considered for if your distribution turns out to be negatively skewed.
Why Should we bother much about Skewness?
We already discussed that a real-life data will not be always perfect, and it will be skewed towards some side, but before discussing on what extent of skewness is bad to our results, let us look at why skewness brings that imbalance in our results.
If our data is skewed it obviously represents the presence of outliers in the data, which will highly affect the result of my model’s performance, mainly in regression-based models since always it depends on the mean and standard deviation for the prediction of results. There are some new and good tree-based models that can nullify the effect of outliers on results and still can give some amazing results with skewed data also, but still, we miss a chance of trying out different models for prediction.
The formula for measuring Skewness:
There are various ways to calculate the skewness of the data distribution we can do it using the mean, mode, and Median
Normally the formula to calculate skewness using Mean of distribution is laid down as below:
- Xi= ith Variable of distribution
- X= Mean of probability Distribution
- N = Number of Variables in the data set
- Ơ = Standard Deviation of Distribution
Pearson has developed two coefficients to determine the skewness using the mode and median called SK1 and SK2 uses them respectively, SK1 is preferred when there is a strong mode in the distribution and Sk2 is preferred when you have a Weak mode.
µ- Mean of distribution, – Standard Deviation of Distribution, MD- Median, MO-Mode
Interpretation from results:
- The sign determines the type of skewness, i.e. whether it is positively or negatively skewed
- This co-efficient is there to see how much the sample distribution varies from the normal distribution, basically a comparison. If its value is zero, we can say there is absolutely no difference between the distributions.
- The weight or value of co-efficient says on what par it is skewed.
Until what extent a model can bear the skewness?
There are certain thumb rules in statistics on skewness
- If the skewness of data lies between -0.5 to 0.5, we can say that data is nearly symmetrical.
- If Skewness of data is anywhere lying in between -1 to -0.5 in negatively skewed data and 0.5 to 1 in positively skewed data, then it is considered to be moderately skewed.
- If Skewness is more than -1 or 1 then it is considered as highly skewed data.
Transformation of data to reduce skewness:
Transformation of data is nothing but making the data to look as good as Normal since statistical models assume the data to be normally distributed.
Transformations are needed when the data is moderately skewed in certain cases, but it is compulsory when it is highly skewed if you want to try out some classic regression models for your predictions and if you expect results of them to be unbiased.
Let us say we have certain data of salaries of Besant technologies staff and along with salaries of Ambani and Adani in it, so my mean will be highly affected by the salaries of Ambani and Adani it moves to higher side and my data becomes positively skewed, so somehow I need to make transformations in my data so that those two outliers come towards the middle of distribution and it almost becomes normal.
There are many transformations available for data and different kinds of techniques for different levels and types of skewness.
Transformations for left skewness:
We can use Squared (Convert X to X^2), cubed (Convert X to X^3), or other higher power (Convert X to X^N) transformations on data to reduce left skewness depending upon the level of skewness.
Transformations for right skewness:
We can use cube root (Convert X to X^1/3), square root (convert X to X^1/2) or logarithm transformation to reduce right skewness depending on the level of skewness.