Data Science, Machine learning, and AI are just glorified Statistics!

I came across this meme when I was scrolling though social media and it cannot be further from the truth. Data Science, ML and AI are some of the most hyped topics lately and if you pursue them you’ll realize that their very core is statistics.

We study statistics in our secondary education and in college if you take up engineering, but we don’t stress enough upon its importance. Statistics is such a diverse field in itself. So here are some of the very basic terms and measures that will get you started.

Measures of location (or as we call it — measures of central tendency)

  • Mean
  • Median
  • Mode
  • Percentile
  • Quartile

How much of its real life application or purpose do we understand, what do we infer from applying this mechanism or formula. Let’s learn that here…

First Mean, what do we mean by mean? We add all the data and divide it by the total number of data values provided.

So here, all values are taken into consideration (extreme values including outliers). Mean, sometimes also called as average, is the value around which all values lie.

Median, on the other hand, is literally the middle term, irrespective of the values on either side. That is, when arranged in ascending or descending order, the middle term is the median of the data-set.

It says that half of the values in the data set are either higher or lower than the median value.

This value does not vary with variation of extreme values. It only varies when the order or number of values changes.

For example : 1, 2, 3, 4, 5, 6, 7 — the median is 4

now consider 1, 2, 3, 4, 7, 7, 7 — the median is still 4, irrespective of the values being changed.

That is because as previously mentioned, it only mentions the the term which divides the entire data set into two half, one which is more(or equal to), and another that is (less than or equal to)

Mode, is the easiest concept to grasp. It is the most frequently occurring term in the data set. Some times a data can have no mode, where as it can have multiple modes as well.

The next measure for central tendency is percentile, so here we divide the whole data set into partitions, so when we say top 1percentile, in general, we take it with respect to 100. So divide the whole set into 100 partitions and the top one is selected and the rest (99) is rejected. 8 out of 10 percentile means, 8 parts of 10 are chosen and the rest are discarded.

The last measure of central tendency is Quartile. Now image a jar filled with various colors per unit length. Quartiles as the name suggest divides the entirety into quarter (1/4's). So the first 1/4 of the whole is called the 25th quartile or Q1, add another 1/4 and that makes is 1/2 this is the 50th quartile or Q2, and finally add another 1/4 which makes it 3/4, the 75th quartile or otherwise called as Q3.

Measures of variability

  • Range
  • Interquartile Range
  • Semi Quartile Range
  • Variation
  • Standard Deviation
  • Coefficient Of Variation

Next we have the measures of variance. Here we measure how different they are from one another. How spread out they are.

So let’s start with range, so what is range, it is the highest and the lowest value of the data set. It measures how spread the values are.

Next, IQR — Inter Quartile Range this is a measure that uses quartiles (measure of central tendency). It is the difference between Q3 and Q1, it represents the range of the elements that comprise the middle portion of the data.

Now if you notice, range and IQR do not consider all elements in the data, the consider a few and number of terms. As useful as they can be they do not provide precise values when we want to consider all the terms in the data. The next measure provides a solution for the previously mentioned issue. Variance is the measure that takes into consideration every single term in a given set of data. Variance measures to the deviation of the terms. In order to calculate the deviation of terms, we would require to establish a standard term with respect to which the deviation is calculated. This term is the mean of the data. Each value is now compared to the mean and the variance is calculated.

Next measure is the Standard Deviation. It is the square root to variance. This gives us a value that is much similar to that of the terms in the data that we are working with. Standard Deviation and variance together are important measures of variability.

Coefficient of Variance (CV) is a measure free of relative variation. It is the ratio of standard deviation and mean of the data. It provides us with the information about who varied the data is in a data set. It can also be used to compare the variance of different datasets. It is measured as a percentage.

All of the above measures are also called as UNIVARIATE measures as the deal with one variable at a time.

But in real time we hardly work with data that has only one variable. When we work with data with two variables it is called as bi-variate analysis. And when you work with data with more than two variables it is called as multi-variate analysis.

Let us check out an important bi-variate analysis measure — Correlation. Correlation measures the extent to which the two variables are related to each other. There are various methods to find the correlation, the most used ones are the Person’s correlation and Spearmans’ correlation.

There is another measure called as — Covariance. It measures the variation of one variable with respect to the other. Correlation is preferred over Covariance.

just another X-shaped personality, love to learn and tinker with new tech.