Unleashing the Power of Sampling: Exploring the Central Limit Theorem

Prakhar Patel
7 min readJun 25, 2023

“Facts are stubborn things, but statistics are pliable”

We all know that mathematics and statistics play inevitable role in the field of Data Science and Machine Learning. Here We going to learn some intermediate concepts of statistics which are being used by data professionals to solve real-life problem and which give some remarkable results. Before reading this one, I would highly recommend to visit my following article for better understanding:

Normal Distribution

In probability theory and statistics, the Normal Distribution, also called the Gaussian Distribution, is the most Substantial continuous probability distribution. Sometimes it is also denoted as a bell curve.

  • Normal Distribution represents how data is shaped or distributed when there are plotted on frequency diagram or a histogram.
  • The arithmetic Mean, Median and Mode are equal and located at the middle called Highest Point or The Peak.
  • Normal Distribution has two primary parameters: Mean(μ) and Standard Deviation(σ). The distribution then falls symmetrically around the mean, the width of which is defined by the standard deviation.
  • Also denoted by: N ~ (μ, σ2) where N stands for Normal Distribution

Normal Distribution Formula

The probability density function of normal or gaussian distribution is given by;

Where,

μ = Mean
σ =
Standard Deviation
f(x) = Probability Distribution Function

Standard Normal Distribution

The standard normal distribution, also called the z-distribution, is a special normal distribution where the mean is 0 and the standard deviation is 1.
It is also Denoted by: ~(μ,σ2) → (0,1)

Standard Normal Distribution

Standardization

Standardization means changing different variables on the same scale so that they can be comparable. Usually, to standardize an observed value(x), subtract the mean(μ) and divide the difference by standard deviation (σ).This gives a standard score(z-score). A standard score tells how many standard deviations a score is away from the mean.

  • A positive z-score means that your x-value is greater than the mean.
  • A negative z-score means that your x-value is less than the mean.
  • A z-score of zero means that your x-value is equal to the mean.

Why we need standardization ?

  • When we need to compare data which come from various sources, the comparison becomes irrelevant as not all data represent same featured distributions, some data are normally , some are binomially or uniformly distributed. Thus, standardization is used to rescale the features and distribution.
  • Rescaling is done so that it will have the properties of a normal distribution with mean 0 and standard deviation 1 called standard normal distribution.

For example,
Suppose two students A and B appear in math’s exams organized by two different organizations respectively. Student A scores 100/150 and Student B scores 500/700 marks in their exams. Here, we can not directly compare 100 and 500 marks because of their distinct scale(different maximum marks). Here, standardization can be used. After standardization, directly compare both the marks.

Here are a few additional points to further elaborate on the need for standardization:

  1. Eliminating scale differences: Standardization eliminates scale differences, ensuring meaningful and accurate comparisons across datasets. It rescales variables to a common scale, mitigating bias and enabling unbiased interpretations.
  2. Addressing variable units: Standardization takes into account varying units, enabling comparisons based on relative positions as opposed to absolute numbers. Even when comparing variables with various unit measurements, it maintains fair comparisons. As a result, data can be analyzed and interpreted in a more meaningful way.
  3. Mitigating the impact of outliers: Outliers have a substantial impact on data processing and interpretation. By putting outliers in perspective relative to the mean and standard deviation, outliers are more easily recognized and handled properly. Standardization lessens the impact of extreme values, enabling statistical analyses that are more reliable.
  4. Enhancing model performance: Standardizing data aligns with the assumptions of statistical and machine learning methods, enhancing performance and prediction accuracy.
  5. Interpreting coefficients and the significance of characteristics: Standardized coefficients in regression models enable a fair assessment of the influence of various features on the outcome variable. Regardless of their starting scales, we can evaluate the relative importance of each variable by standardizing the predictors.
  6. Dimensionality reduction: Standardization is important for efficient dimensionality reduction in various data analysis techniques, including principal component analysis (PCA) and clustering algorithms. By standardizing the variables, we make sure that each characteristic makes a balanced contribution to the study and avoid having one dominant variable overshadow other variables.

Central Limit Theorem

Introduction:
In statistics, the Central Limit Theorem (CLT) is a crucial concept that helps us understand how sample means behave. It states that, under certain conditions, when we add together independent random variables, their sum tends to follow a normal distribution. In this blog post, we will dive into the important implications of the Central Limit Theorem, its relevance in everyday life, and how it unlocks the potential of sampling for statistical inference.

Basically, Central Limit Theorem states that no matter what the distribution of the sample is if you sample batches of data from that distribution(with replacement) and take the mean of each batch. Then the mean values that we got from all those batches will be normally distributed.

Exploring the Significance of the Central Limit Theorem:

  1. Making Estimates: The Central Limit Theorem enables us to estimate population parameters, such as the average or proportion, using sample statistics. By calculating the mean and standard deviation of a sample, we can make predictions about the larger population.
  2. Hypothesis Testing: The Central Limit Theorem forms the basis for hypothesis testing. It helps us compare sample means to population parameters and determine if observed differences are statistically significant or due to chance.
  3. Everyday Surveys: When conducting surveys or opinion polls, the Central Limit Theorem allows us to draw meaningful conclusions about a population by sampling a smaller group. By ensuring randomness and independence in the sampling process, we can trust the reliability of the survey results.
  4. Quality Control: In quality control processes, the Central Limit Theorem is essential for monitoring continuous measurements. By examining sample means and analyzing control charts, organizations can detect deviations from the expected normal distribution, which could indicate potential quality issues.
Source: Wikipedia -Whatever the form of the population distribution, the sampling distribution tends to a Gaussian, and its dispersion is given by the central limit theorem

As the sample size increases, the Central Limit Theorem states that the distribution of sample means will approach a normal distribution, regardless of the shape of the original distribution. This convergence occurs because larger samples provide more information about the underlying population, smoothing out the effects of any deviations from normality in the individual samples. Therefore, when dealing with non-normal distributions, a larger number of samples is required to achieve a reliable approximation to the normal distribution.

Demonstration of Central Limit Theorem

  1. In the code, you generated a dataset (population) with 10,000 random values following an exponential distribution.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Generate random data from a non-normal distribution
population = np.random.exponential(scale=2, size=10000)

2. Now let’s look at the how distribution of population looks like.

# Plot the original population distribution
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.hist(population, bins=30, density=True, alpha=0.7)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Original Population Distribution')
sns.kdeplot(population)
Original Population Distribution

3. Generate sample means from random samples

# Number of samples and sample size
num_samples = 1000
sample_size = 50

# Initialize an array to store the sample means
sample_means = []

# Generate sample means from random samples
for _ in range(num_samples):
sample = np.random.choice(population, size=sample_size, replace=False)
sample_mean = np.mean(sample)
sample_means.append(sample_mean)

4. Now let’s look at the how distribution of sample means looks like.

# Plot the distribution of sample means
plt.subplot(1, 2, 2)
plt.hist(sample_means, bins=30, density=True, alpha=0.7)
plt.xlabel('Sample Mean')
plt.ylabel('Frequency')
plt.title('Distribution of Sample Means')
sns.kdeplot(sample_means)

plt.tight_layout()
plt.show()

Check that the samples we took from the population have mean values that are almost normally distributed. The central limit theorem clearly states that.

And as sample size(n) increases, this will incline towards a more normal distribution.

Conclusion

In conclusion, the central limit theorem and the normal distribution are essential statistical ideas that offer a strong framework for comprehending data and drawing conclusions. Many statistical and machine learning models include assumptions about the normal distribution, which is characterized by its mean and standard deviation. According to the central limit theorem, the average or sum of several independent random variables with identical distributions has a propensity to follow a normal distribution. This theorem is useful for statistical inference since it allows us to estimate population means, compute confidence intervals, and run hypothesis tests. We can more effectively analyze data, make precise predictions, and reach meaningful conclusions about the world around us if we adopt these ideas.

Thanks for Reading

If you like my work and want to support me…

  1. The BEST way to support me is by following me on Medium here.
  2. Be one of the FIRST to follow me on Instagram here.
  3. Follow me on LinkedIn here.
  4. Follow me on GitHub here.

--

--

Prakhar Patel

Hello, I’m a computer student passionate about data science. I believe the best way to broaden our knowledge is to share it with people.