# Basic Fundamentals Of Statistics Every Data Scientist Should Know.

## Essential statistical principles to get started on your Data Science journey.

we all are cognizant of the enormous **impact of statistics **in the field of data science in this data-driven world. Finding **patterns, trends, and making predictions **are the most significant steps in **data science**. Here we gonna discuss some basics terms of the statistical world which play a crucial role in statistical data analysis.

# Data types and individuals in data

**Individuals**: individuals are the people or objects included in the **study**. An individual is what the **data** is describing. In a table like this, each **individual** is represented by one row. Sometimes they are also called identifiers.

**Variable: **Variable depicts **information of individuals **that is acquired through measurements, such as length, time, diameter, strength, weight, temperature, density, thickness, pressure, and height. With variables, you can easily pick up trends of any particular individual or group of individuals.

There are mainly two data types:

**Numeric data: **this type of data is expressed with digits. It can be measured. This data is further divided into two subcategories **discrete **or **continuous**. i.e, heights, speed, age, weight, sales, cost, etc.

**Categorical data: **this data is a collection of information that is divided into groups. Qualitative data classified into some categories. i.e, gender, age-group, product category, educational level, etc.

# Measures of Central Tendency

**Mean:** It is a mathematical average of a dataset.

**Median**: It is the middle number in a sorted, ascending or descending, list of numbers and can be more descriptive of that data set than the average.

**Mode: It **is the most frequently observed value **in** a set of data.

# Measures of Variability

**Range:** the difference between the highest and lowest value in a dataset.

Where X = Dataset,

**Variance (σ^2): **It is** **is a measure of how spread out a data set is.

**Standard Deviation (σ): **It is is the measure of dispersion of a set of data from its mean.

**Z-score:** A **Z**-**score** is a numerical measurement that describes a value’s relationship to the mean of a group of **values**. **Z**-**score** is measured in terms of standard deviations from the mean.

**IQR(interquartile range) : **The **interquartile range** is a measure of where the “middle fifty” is in a data set.

# Mean Absolute Deviation(MAD)

The mean absolute deviation of a dataset is the average distance between each data point and the mean. It gives us an idea about the variability in a dataset.

**where**,

**n **= number of data values

**xi **= data value in data set

**m(x)** = mean or average of dataset

**kurtosis and skewness**

**Kurtosis: **It is the characteristic of being flat or peaked. It is a measure of whether data is heavy-tailed or light-tailed in a normal distribution.

**Mesokurtic :**Distributions that are moderate in breadth and curves with a medium peaked height.**platykurtic :**Fewer values in the tails and fewer values close to the mean. (i.e. the curve has a flat peak)**leptokurtic :**More values in the distribution tails and more values close to the mean (i.e. sharply peaked with heavy tails)

**Skewness: **Skewness is a measure of the symmetry of a distrubution. A distribution is skewed if the tail on one side of the mode is fatter or longer than on the other: it is asymmetrical.

**Positively skewed : It**indicates the tail on the right side is longer than on the left side.**Negatively skewed : It**indicates that the tail on the left side is longer than on the right side.

**Cluster**: A group of values sticks together away from other groups.**Outliers**: Some Minority values much away from the crowd (Majority).

Outliers do not affect median and mode. It only affect mean of distribution

**Peaks**: Highest value in the distribution.**Gaps**: The ‘’large’’ open space between some data points.

# Measurements of Relationships between Variables:

## Covariance :

The covariance determines the relationship between two random variables or samples — how they change together. Or in other words we can say that Covariance is a measure of how much two random variables fluctuate together.

**Covariance can be calculated as,**

- Population Covariance Formula.

- Sample Covariance Formula.

# Correlation:

Correlation is a statistical measure that indicates how strongly two variables are related linearly. Or Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together.

**Correlation can be calculated as,**

I have already published this topic in detail . Please go and check it out.

# Moments

**Moments** describe different aspects of the nature and shape of a distribution. The first moment is the **mean,** the second moment is the **variance, **the third moment is the **skewness,** and the fourth moment is the **kurtosis**.

# Thanks for Reading

If you like my work and want to support me…