In grade 9, I was introduced to the Bayes' Theorem using the classic example of medical tests. I was intrigued by how deceiving probability can be, but at the time I didn't fully understand the true applications of Bayes' Theorem. While the example of medical tests may be especially relevant with the ongoing pandemic, the theorem didn't amaze me that much. What is amazing, though, is how it can be used in the context of machine learning.
Note: If you are fully versed in the Bayes' Theorem, you can skip this section and the subsequent example with cancer tests and dive right in to the Naive Bayes section.
Firstly, recall what conditional probability is $$P(Y|X)=\frac{P(Y \cap X)}{P(X)}$$ $P(Y|X)$ means the probability of Y given X. Alternatively, you can think of it as the probability of output Y given the input X. Now, notice that we can simply replace the variables: $$P(X|Y)=\frac{P(X \cap Y)}{P(Y)}$$ But $P(Y \cap X)$ and $P(X \cap Y)$ are exactly the same probability - just the probability of both X and Y occurring. This means that we can take the second equation and rewrite it as $$P(X \cap Y)=P(X|Y)P(Y)$$ Now, we can substitute this into the first equation to get $$\bbox[yellow]{P(Y|X)=\frac{P(X|Y)P(Y)}{P(X)}}$$ For terminology purposes, $P(Y)$ is known as the Prior, $P(X|Y)$ is the Likelihood and $P(Y|X)$ is the Posterior.
Suppose there is a cancer that occurs in 1% of the population, or $P(C)=0.01$
In the jargon of medical diagnosis, the first case is known as the sensitivity and the second is known as the specificity
Now the question is: suppose you take the test, and the test returns positive. What is the probability of you having the cancer?
In other words, we are seeking $P(C|pos)$, the probability that we have cancer given the rest returned positive. Let's translate the rest of the information to mathematical notation as well:
To calculate $P(pos)$, we simply go back to joint probability and do $$P(pos)=P(C, pos)+P(\neg C, pos)=(0.01)(0.9)+(0.99)(0.05)=0.0585$$ Therefore, $$P(C|pos)=\frac{P(pos|C)P(C)}{P(pos)}$$ $$=\frac{(0.9)(0.01)}{0.0585}=2/13≈0.154$$
This means that even if the test returned positive, you only have a 15% chance of actually having cancer.
Naive Bayes is a supervised classification algorithm that makes a naive assumption when using the Bayes' Theorem. But first, let's just apply Bayes' Theorem: $$P(y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots, x_n \mid y)}{P(x_1, \dots, x_n)}$$ where $x_1, x_2, \dots x_n$ represent the features of the data.
Now, the naive assumption is that the features $X_i$ and $X_j$ are conditionally independent given Y, for all $i≠j$. In other words, it is assumed that the features of the data are all independent relative to one another.
Mathematically, we can write $$\begin{aligned} P(x_1, x_2, \dots x_n|Y)&=P(x_1|y) \cdot P(x_2|y) \cdot P(x_3|y) \cdot \dots \cdot P(x_n|y) \\ &= \prod_{i=1}^{n}P(x_i \mid y) \end{aligned}$$ To gain an intuition as to why this is true, I highly recommend you check out this article on Towards Data Science.
But since $P(x_1, x_2, \cdot , x_n)$ is constant, we can introduce a proportionality:$$P(y \mid x_1, \dots, x_n) \propto P(y)\prod_{i=1}^{n}P(x_i \mid y)$$
Finally, to calculate the $P(y)$, we can employ the maximum a posteriori estimation:$$\hat{y} = \arg\max{_y} [P(y) \prod_{i=1}^{n} P(x_i \mid y)]$$
There are three main types of naive Bayes depending on the distribution of $P(x_i \mid y)$: Gaussian, Multinomial and Bernoulli.
For example, this is the Gaussian distribution that I'm sure you have across before. It's literally just the bell curve. $$P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)$$ Here, to find the parameters $\mu$ and $\sigma$ (mean and variance), we use a method called maximum likelihood estimation that I briefly mentioned in my other post, The International Baccalaureate Examinations: Item Response Theory(go check it out if you haven't already!)
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0) classifier = GaussianNB() classifier.fit(X_train, y_train) prediction = classifier.predict(X_test) print("Number of mislabeled points out of a total %d points : %d" % (X_test.shape[0], (y_test != y_pred).sum())) accuracy = classifier.score(X_test, y_test)
[2 1 0 2 0 2 0 1 1 1 1 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0 1 1 1 2 0 2 0 0 1 2 2 1 2 1 2 1 1 2 1 1 2 1 2 1 0 2 1 1 1 1 2 0 0 2 1 0 0 1] Number of mislabeled points out of a total 75 points : 4 0.9466666666666667 [Finished in 2.4s]
This is example code from the sklearn documentation, but it just shows how powerful the naive Bayes algorithm is, despite its very simple nature - an accuracy of 94.7% is quite good!
Machine learning is an extremely fast growing field that opens a whole new door of possibilities in a countless number of fields. It is an amalgamation of various mathematical topics including linear algebra, calculus, statistics and probability. There's so much to cover, that it's difficult to write about an algorithm in just a single post. I will probably be doing a similar article on other ML algorithms, so stay tuned!
Sounds familiar!