Updated: Apr 6
In all the machine learning problems, we work on sample data which is a part of larger dataset which is called as population. Depending on the nature of sampling, ML model thus developed might have errors due to this sampling. Since we work with a sample data, extrapolation is done for the actual population parameters. The extrapolation is done using probability and statistics to minimize the sampling error. This generation of representative data from a given sample is often referred to as probability distribution. By the end of this article you will have good conceptual understanding on probability distributions.
While there are many important probability distributions, they can be broadly categorized in two types based on the random variable that generates these distributions.
Discrete probability distributions if the random variable is discrete.
Continuous probability distributions if the random variable is continuous.
This article will discuss the most important and prevalent probability distributions used to build Machine Learning models.
One widely used probability distribution of a discrete random variable is the binomial distribution. The binomial distribution describes discrete, not continuous, data, resulting from an experiment known as Bernoulli process. A Bernoulli random variable has only two possible outcomes: success or failure (0 or 1). A binomial distribution is the sum of independent and identically distributed Bernoulli random variables.
For instance, tossing a fair coin fixed number of times is a Bernoulli process and the outcomes of such tosses can be represented by the binomial probability distribution. Lets assume the probability of getting a heads on a coin toss is represented by 'p'. Then the probability of getting tails will be '1 - p'. For a single coin toss, the random variable that represents your win is a Bernoulli random variable. Now, what is the probability that you get exactly three heads in five tosses? That would require you to toss the coin five times, getting exactly three heads and two tails. This can be achieved with probability:
C(5,3) * p^3 * (1−p)^2 where C represents the combination
In general, if there are 'n' Bernoulli trials, then the sum of those trials is binomially distributed with parameters n and p. To calculate 'r' number of success out of 'n' trials with success probability as 'p' and failure probability as 'q', binomial formula says -
[n! / r!(n-r)!] * p^r * q^(n-r)
Assumptions of a Bernoulli processes are -
Each trial has only two possible outcomes
The probability of the outcome of any trial remains fixed over time. Every Bernoulli process has its own characteristic probability (represented as 'p').
Trials are statistically independent.
A Poisson Process is a model for a series of discrete event where the average time between events is known, but the exact timing of events is random. In other words these events are randomly spaced. Examples which follow this kind of probability distribution are calls received by a customer care center, traffic movement, movements in a stock price, etc. While the Poisson process are associated with time but they don't have to be necessarily associated with time. A classic example of this is that of trees per acre, which is events per area rather than events per time, also follows Poisson process.
We need the Poisson Distribution to do interesting things like finding the probability of a number of events in a time period or finding the probability of waiting for some time until the next event. The Poisson Distribution probability mass function gives the probability of observing 'x' events in a time period given the length of the period and the average events per time:
P(x events in a time period) = (λ^x * e^-λ) / x! λ (rate parameter)= events per time interval * time_period
Assumptions of a Poisson process are -
Events are independent of each other. The occurrence of one event does not affect the probability that another event will occur.
The average rate (events per time period) is constant.
Two events cannot occur at the same time. i.e. the probability of getting more than one success in an interval is 0.
Last assumption means we can think of each sub-interval of a Poisson process as a Bernoulli trial, that is, either a - success or a failure.
As we change the rate parameter, λ, we change the probability of seeing different numbers of events in one interval. The most likely number of events in the interval for each curve is the rate parameter. This makes sense because the rate parameter is the expected number of events in the interval.
Relationship between Binomial and Poisson
To compare the above two discrete probability distributions, lets consider the example of a car manufacturer. For any such manufacturing plant, number of defective cars is the most critical KPI. This KPI can relate to both the probability distributions discussed above. Number of defective cars in 'n' units manufactured is dealt by Bernoulli process and number of defective cars in every unit of time is dealt with Poisson process.
Mathematically, another relation between the two distributions is that if 'n' is large and 'p' is small, the Poisson distribution with λ= np closely approximates the binomial distribution.