Skip to content Skip to footer

A Simple Guide to Data Distribution in Statistics and Data Science

Data distribution plays a major role in defining a mathematical function that can help in calculating the probability of any observation from the data space. There are various uses of data distribution we find in the statistical and data science processes. For example it can describe the grouping of observations in a dataset. This is one of the major statistics topics and helps understand the data better. In this article, we will discuss the statistical data distribution basics using the following points.

Table of Contents

  • What is distribution?
  • What is density Function?
  • Types of Distribution in Statistics
  1. Gaussian Distribution
  2. Student T-distribution
  3. Chi-squared Distribution
  4. Bernouli Distribution
  5. Binomial Distribution
  6. Poisson Distribution
  7. Exponential Distribution
  8. Gamma Distribution

What is Distribution?

We can think of distribution as a function that can be used to describe the relationship between the data points in their sample space.

We can use a continuous variable like Age to understand this term where the Age of an individual is an observation in the sample space, and 0 to 50 is the extent of the data space. The distribution is a mathematical function that will tell us the relationship of observations of different heights.

Often generated data follows a well-known mathematical function like The Gaussian Distribution. Generally, these functions are capable of fitting the data if the parameters of the functions are modified. The distribution functions can be used to describe and predict related quantities and relationships between domain and observation.

What is Density Function?

The distribution of data points can be described by their density or density function. Using this function, we describe how the proportion of data changes over the range of the distribution. There are mainly two types of density functions:

  • Probability Density Function: Using this function, we can calculate the likelihood of a given value or observation in its distribution space. Often we summarise this for all observations across the distribution space. While we plot this function, we et shape of distribution as a result. Using the plot, we can tell the type of distribution in the data.
  • Cumulative Density Function: Using this function, we can calculate the cumulative likelihood of a given value or observation in the sample space. We can get the cumulative density function by adding all prior observations in the sample space In the probability density function. By plotting this function, we can understand how data is distributed before and after a given value. The plot of this function often varies between 0 to 1 for the distribution.

One noticeable thing here is that both of these functions are continuous functions and in the case of discrete data probability mass function is equivalent to the probability distribution function.

Here we get knowledge of distribution and density functions. After that, types of distribution come into the picture. Let’s take a look at different types of distribution.

Types of Distribution

There are mainly three types of distribution:

Gaussian Distribution

This type of distribution is most commonly found distributions in real-world data. That’s why we sometimes call it the normal distribution. This distribution is named after Carl Friedrich Gauss and mainly focuses on the field of statistics.

The following two parameters help in defining the gaussian distribution:

  • Mean: a Quantity that is an intermediate value of the large distribution of data observations.
  • Variance: A quantity that helps in measuring the spread between data observations.

To measure the variance, we often use the standard deviation that defines the spread of data observation from their mean values.

Using the below code, we can make data where observations are normally distributed and plotting them gives us a perfect example of normal or gaussian distributed data.

#importing libraries

import numpy as np

import matplotlib.pyplot as plt

from scipy import stats

#making data

array = np.arange(-10,10,0.001)

data = stats.norm.pdf(array, 0.0, 2.0 )

#ploting probablity density function

plt.plot(array, data)

Output:

Here mean of the data is zero, and the standard deviation is two. For the same data, we can also plot the cumulative density function.

CDF = stats.norm.cdf(array, 0.0, 2.0 )

plt.plot(array, CDF)

Output:

As defined in the code here in the plot, we can see that 50% of the data is lying below the mean(0) point.

Student T-distribution

Estimating the mean of a normal distribution with samples of different sizes is the reason behind this distribution. We can also call it t-distribution. Calculation of this distribution is helpful when describing the error in estimating population statistics for data drawn from Gaussian distributions, and the sample size is taken into account.

The degree of freedom can be used to describe the t-distribution. Also, the calculation of the degree of freedom is one of the main reasons behind using the t-distribution. Using the degree of freedom for any observation helps in describing the population quantity.

For example, if the degree of freedom is n, then we can use n observation from the data can be used to calculate the mean of the data.

Two calculate the observation in a t-distribution, we need to know the observations n the gaussian distribution so that we can define the interval for the population mean in the normal distribution. Observations in t-distribution can be calculated using the below formula:

Data = ( X — mean(X)) / S / sqrt(N) )

Where,

X = observations from normal or gaussian distributed data.

S = standard deviation of X.

N = total number of observations

We can calculate and plot the PDF and CDF for this distribution using the following lines of code.

Calculating Probably Density Function

DOF = len(array) — 1

PDF = stats.t.pdf(array, DOF)

plt.plot(array, PDF)

Output:

 

Calculating Cumulative Density Function

CDF = stats.norm.cdf(array, 0.0, 2.0 )

plt.plot(array, CDF)

Output:

Chi-Squared Distribution

This type of data distribution helps in describing the quantity of uncertainty of data drawn from the Gaussian distribution. One of the best examples of a statistical method is the chi-squared test, where chi-squared distribution is used often. This distribution can also be used in the derivation of the t-distribution. Like t-distribution, this can also be described using the degree of freedom of observation.

Observation in this distribution can be calculated as the sum of k-squared observations drawn using the Gaussian distribution. Mathematically,

Where,

Z1, …, Zk are samples that are gaussian distributed, and with the degree of freedom k, we can denote chi-squared distribution as

Again, like t-distribution, data usually do not follow this distribution here. Instead, observations are drawn from chi-squared distribution in statistical method calculation for a part of gaussian distributed data.

We can calculate and plot the PDF and CDF for this distribution using the following lines of code.

array = np.arange(0,50,0.1)

DOF = 10

PDF = stats.chi2.pdf(array,DOF)

plt.plot(array,PDF)

Output:

Here we can see that as given the degree of freedom, the distribution changes because the sum of the square random observations from the normally distributed data is under the degree of freedom(10 in this case). However, it is bell-curved but not symmetric.

Calculating the cumulative density function:

CDF = stats.chi2.cdf(array,DOF)

plt.plot(array,CDF)

Output:

Here we can see that there is a fat tail at the right of the distribution, which is continued to the last point.

Bernouli distribution

This type of data distribution mainly comes into the picture when there are only two possible outcomes, and this distribution describes the probability of an event that has been reported only once.

The example of outcomes can be success and failure, 0 and 1, yes or no. to describe this distribution, we use only one parameter, which is the probability of success. Using the below lines of code, we can create a situation with Bernoulli distribution.

p = 0.5

variable = stats.bernoulli(p)

fig, ax = plt.subplots(1, 1)

x = [0, 1]

ax.vlines(x, 0, variable.pmf(x), label=’probability’)

ax.legend(loc=’best’, frameon=False)

plt.show()

Output:

We can simply consider the above example as a result of a coin flip, where we can consider any side as a success, and the probability of success will be 0.5.

Binomial distribution

The above distribution was repeated only once but this type of distribution models the number of successes in a situation of repeated Bernoulli experiment. This directly focuses on a success count instead of focusing on the probability of success.

As discussed above, this distribution can be described using two parameters:

  • Number of experiments
  • Probability of success

We can use the below line of codes to take an idea of the binomial distribution.

n = 10 #number of events

p = 0.5 #probablity of success

array = np.linspace(0, 10, 11)

fig, ax = plt.subplots(1, 1)

variable = stats.binom(n, p)

ax.vlines(array, 0, variable.pmf(x), label = ‘probability’)

ax.legend(loc=’best’, frameon=False)

plt.show()


Output:

We can compare the above example with flipping a coin ten times, and the graph gives us the probability for each number of success out of 10.

Poisson Distribution

This distribution includes the time parameter with it. Till now, we have seen distribution for one event and a number of events, but here this distribution describes the number of events in a time period. A simple example of this type of case is the number of vehicles passing through a toll booth in 1 hour.

In the extension of the example, we can say there is an average value of vehicles passing through toll booths for different units of time. To describe this distribution, we only need the time parameter. Let’s take a look at the below codes:

time = 4

array = np.linspace(0, 15, 16)

fig, ax = plt.subplots(1, 1)

variable = stats.poisson(time)

ax.vlines(x, 0, variable.pmf(x), label = ‘probability’)

ax.legend(loc=’best’, frameon=False)

plt.show()

Output:

The example can be compared to a toll booth where, every one-hour time interval, a number of cars are passing.

Exponential distribution

From the above distribution, if we replace the event per unit time with the waiting time between events, we get the exponential distribution. Simple by inversing the time parameter in the Poisson distribution, we can get describe this distribution.

import pandas as pd

time = 4

n_simulated = 10000000

random_waiting_times = stats.expon(scale = 1 / time).rvs(n_simulated)

pd.Series(random_waiting_times).hist(bins = 40)


Output:

Here we need to read the x-axis as the percentage of unit time. Here we have 4 events to take place in one hour.

Gamma distribution

This type of distribution can be described by the wait time for an event to occur. We can think of it as a variation of exponential distribution because it takes parameters for the number of events to wait for with the lambda parameter of the exponential distribution.

We can take an example from the bus that is waiting to run when the number of passengers is not filled in it. Let’s take a look at the below graph.

time = 4 # event rate of customers coming in per unit time

n_simulated = 10000000

waiting_times = stats.gamma(10, scale = 1 / time).rvs(n_simulated)

pd.Series(waiting_times).hist(bins = 20)

Output:

In the above graph, we can see that the peak is around 2.5, which means the waiting time for ten passengers to come on the bus is 2.5 times the unit time of 2 minutes.

Final words

In this article, we have discussed data distribution, Density Functions and Type of Distribution in data. In statistical and data science processes, these all things come in the early stages of processes that helps us in generating more knowledge about the data we have. Since understanding the data domain and statistics are important tasks for a data scientist, with the help of the above-given knowledge, we can understand the statistics of our data.

References

About DSW

Data Science Wizards (DSW) is an Artificial Intelligence and Data Science start-up that primarily offers platforms, solutions, and services for making use of data as a strategy through AI and data analytics solutions and consulting services to help enterprises in data-driven decisions.

DSW’s flagship platform UnifyAI is an end-to-end AI-enabled platform for enterprise customers to build, deploy, manage, and publish their AI models. UnifyAI helps you to build your business use case by leveraging AI capabilities and improving analytics outcomes.

Connect us at contact@datasciencewizards.ai and visit us at www.datasciencewizards.ai