Basic Fundamentals Of Statistics Every Data Scientist Should Know.

Essential statistical principles to get started on your Data Science journey.

Prakhar Patel
5 min readMar 10, 2021

we all are cognizant of the enormous impact of statistics in the field of data science in this data-driven world. Finding patterns, trends, and making predictions are the most significant steps in data science. Here we gonna discuss some basics terms of the statistical world which play a crucial role in statistical data analysis.

Data types and individuals in data

Individuals: individuals are the people or objects included in the study. An individual is what the data is describing. In a table like this, each individual is represented by one row. Sometimes they are also called identifiers.

Variable: Variable depicts information of individuals that is acquired through measurements, such as length, time, diameter, strength, weight, temperature, density, thickness, pressure, and height. With variables, you can easily pick up trends of any particular individual or group of individuals.

There are mainly two data types:

Numeric data: this type of data is expressed with digits. It can be measured. This data is further divided into two subcategories discrete or continuous. i.e, heights, speed, age, weight, sales, cost, etc.

Categorical data: this data is a collection of information that is divided into groups. Qualitative data classified into some categories. i.e, gender, age-group, product category, educational level, etc.

Measures of Central Tendency

Mean: It is a mathematical average of a dataset.

Median: It is the middle number in a sorted, ascending or descending, list of numbers and can be more descriptive of that data set than the average.

Mode: It is the most frequently observed value in a set of data.

Measures of Variability

Range: the difference between the highest and lowest value in a dataset.

Where X = Dataset,

Variance (σ^2): It is is a measure of how spread out a data set is.

Standard Deviation (σ): It is is the measure of dispersion of a set of data from its mean.

Z-score: A Z-score is a numerical measurement that describes a value’s relationship to the mean of a group of values. Z-score is measured in terms of standard deviations from the mean.

IQR(interquartile range) : The interquartile range is a measure of where the “middle fifty” is in a data set.

Mean Absolute Deviation(MAD)

The mean absolute deviation of a dataset is the average distance between each data point and the mean. It gives us an idea about the variability in a dataset.

where,

n = number of data values

xi = data value in data set

m(x) = mean or average of dataset

kurtosis and skewness

Kurtosis: It is the characteristic of being flat or peaked. It is a measure of whether data is heavy-tailed or light-tailed in a normal distribution.

  • Mesokurtic : Distributions that are moderate in breadth and curves with a medium peaked height.
  • platykurtic : Fewer values in the tails and fewer values close to the mean. (i.e. the curve has a flat peak)
  • leptokurtic : More values in the distribution tails and more values close to the mean (i.e. sharply peaked with heavy tails)

Skewness: Skewness is a measure of the symmetry of a distrubution. A distribution is skewed if the tail on one side of the mode is fatter or longer than on the other: it is asymmetrical.

  • Positively skewed : It indicates the tail on the right side is longer than on the left side.
  • Negatively skewed : It indicates that the tail on the left side is longer than on the right side.
  • Cluster: A group of values sticks together away from other groups.
  • Outliers: Some Minority values much away from the crowd (Majority).

Outliers do not affect median and mode. It only affect mean of distribution

  • Peaks: Highest value in the distribution.
  • Gaps: The ‘’large’’ open space between some data points.

Measurements of Relationships between Variables:

Covariance :

The covariance determines the relationship between two random variables or samples — how they change together. Or in other words we can say that Covariance is a measure of how much two random variables fluctuate together.

Covariance can be calculated as,

  1. Population Covariance Formula.
  1. Sample Covariance Formula.

Correlation:

Correlation is a statistical measure that indicates how strongly two variables are related linearly. Or Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together.

Correlation can be calculated as,

I have already published this topic in detail . Please go and check it out.

Moments

Moments describe different aspects of the nature and shape of a distribution. The first moment is the mean, the second moment is the variance, the third moment is the skewness, and the fourth moment is the kurtosis.

Thanks for Reading

If you like my work and want to support me…

  1. The BEST way to support me is by following me on Medium here.
  2. Be one of the FIRST to follow me on Instagram here.
  3. Follow me on LinkedIn here.
  4. Follow me on GitHub here.

--

--

Prakhar Patel

Hello, I’m a computer student passionate about data science. I believe the best way to broaden our knowledge is to share it with people.