Random variable: Mapping between outcomes and real values that ease the probability calculus.
Independence: The occurrence of an outcome \(A_1\) variable doesn’t affect the other \(A_2\): \(p(A_1 \cap A_2) = p(A_1) \cdot p(A_2)\)
MLE finds the parameters which better explain the data. I.e. which achieve a higher likelihood on the distribution assumed (through expert knowledge or distribution tests):
\begin{equation} \theta^\star = \arg \max_\theta (\mathcal{L} (\mathcal{D}, \theta)) \end{equation}
In some cases (see Deep Generative Models) the assumed functional form of the likelihood is so complex that this optimization cannot be done analytically. We then use other optimization techniques (s.a. stochastic gradient ascent to optimize the likelihood).
Toss of a coin with probability of head of \(\lambda \in [0, 1]\).
\begin{equation} X \in {0, 1} \end{equation}
\begin{equation} p(X = x \mid \lambda) = \begin{cases} \lambda & \textrm{if} x = 1 \newline 1-\lambda & \textrm{if} x = 0 \end{cases} \end{equation}
\[E [X] = \lambda \cdot 1 + (1 - \lambda) \cdot 0 = \lambda\]\(V [X] = E [(X - E (X))^2]\) =
Toss of a dice with \(k\) faces and probabilities \(\lambda_1, ..., \lambda_k\). Where \(\sum_i \lambda_i = 1\), and \(\lambda_i \in [0, 1] \forall i\).
\begin{equation} X \in {1:k} \end{equation}
\begin{equation} p(X=x \mid \lambda_{i=1:N}) = \begin{cases} \lambda_1 & \textrm{if} x = 1 \newline \lambda_2 & \textrm{if} x = 2 \newline … \end{cases} \end{equation}
Categorical Cross-Entropy is often used as a Loss function for ML models. Remember that the cross-entropy between the data distribution and the approximate one is:
\[\mathcal{H} (P, Q) = E_P [I_Q] = - \sum p(P=p_i) \log (Q = p_i) = - \sum \lambda_i \log (\hat \lambda_i)\]Usually for classification tasks \(\lambda_i\) is a hard label, so the loss associated with a datapoint becomes \(- \log (\hat \lambda_i)\) (negative log of the guessed prob). Remember that the cross-entropy is often used when the data information is fixed, thus KL divergence \(D_{KL} (P \Vert Q) = E_P [I_Q - I_P]\) is a simple translation of \(\mathcal{H} (P, Q)\).
Number of successes in \(n\) Bernoullies (with mean \(\lambda \in [0, 1]\)).
\begin{equation} X \in {1:n} \end{equation}
\begin{equation} p(X = x \mid \lambda) = {N \choose x} \lambda^x (1 - \lambda)^{n - x} \end{equation}
For large \(n\) and \(\lambda \simeq \frac{1}{2}\) it behaves as a discretization of the normal distribution.
It models the number of counts of a k-sided die rolled n times. I.e. \(x_i\) counts the number of times side \(i\) appeared when rolling the dice \(n\) times, thus:
\begin{equation} x_{i \in 1:k} \in {0:n} \end{equation}
Where: \(\sum_{i=1}^k x_i = n\)
\begin{equation} p(X={x_1, …, x_k} \mid N, \lambda_{i=1:N}) = {N \choose {x_1, …, x_k}} \lambda_1^{x_1} … \lambda_k^{x_k} \end{equation}
Notice this is a generalization of binomial distribution for categorical variables.
For large \(n\) and \(\lambda \simeq \frac{1}{2}\) it behaves as a discretization of the normal distribution.
Counts the number of failures before the first success in a Bernoulli r.v.
\begin{equation} x \in \mathbb{N} \end{equation}
\begin{equation} p(X = x \mid \lambda) = (1 - \lambda)^{x-1} \lambda \end{equation}
Counts the number of random independent events happening in a fixed period of time (or space). Imagine that on average \(\lambda \in \mathbb{R}_{[0, 1]}\) events happen within a time period, then we can get the probability of \(x\) events happening by evaluating the Poisson distribution.
\begin{equation} x \in \mathbb{N} \end{equation}
\begin{equation} p(X = x \mid \lambda) = \frac{\lambda^x e^{- \lambda}}{x!} \end{equation}
Assigns same likelihood to all values between a range.
\begin{equation} x \in [a, b] \end{equation}
\begin{equation} PDF(x, a, b) = \frac{1}{b-a} \end{equation}
MLE: Given a dataset \(\mathcal{D} = \{ x_i \}_{i=1:n}\) the MLE estimation of its parameters: \(a, b\) is: \(a = \min_i {x_i}, b = \max_i {x_i}\). We can easily see it be expressing the likelihood of the dataset:
\begin{equation} \mathcal{L (\mathcal{D}, a, b)} = \frac{1}{b - a}^n \end{equation}
If we want to maximize the likelihood we need to minimize \(b - a\) but \(x_i \in [a, b] \forall i\). Thus, the minimum is achieved when we take b as the maximum and a as the minimum.
\begin{equation} x \in \mathbb{R} \end{equation}
\begin{equation} PDF(x, \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{- \frac{1}{2} \frac{(x - \mu)^2}{\sigma^2}} \end{equation}
Central limit theorem: The sum of independent samples from any distribution tends to follow a normal distribution (as the number of samples grows).
Models the sum of k squared standardized normals.
\begin{equation} x \in (0, \infty) \end{equation}
\begin{equation} PDF(x, k) = \frac{1}{\Gamma (k) 2^{\frac{k}{2}}} x^{\frac{k}{2} - 1} e^{- \frac{x}{2}} \end{equation}
Measures the probability of the amount of time between two Poison-type events. \(\lambda\) again is the expected number of events within the
\begin{equation} x \in [0, \infty) \end{equation}
\begin{equation} PDF(x, \lambda) = \lambda e^{- \lambda x} \end{equation}
Can be thought of as a continuous version of a Geometric distribution. “Number of failures until one success” is analogous to “time until event”.
The same way the exponential distribution predicts the amount of time until the first Poison event, gamma distribution predicts the time until the k-th Poison event.
\begin{equation} x \in (0, \infty) \end{equation}
Presents two representations. One with shape (“number” of events \(k\)) and scale parameters (\(\theta\)):
\begin{equation} PDF(x, k, \theta) = \frac{1}{\Gamma (\alpha) \theta^k} x^{k - 1} e^{- \frac{x}{\theta}} \end{equation}
And one with shape (\(\alpha = k\) “number” of events) and rate parameters (\(\beta = \frac{1}{\theta}\) rate at which Poison events occur \(\lambda\)):
\begin{equation} PDF(x, \alpha, \beta) = \frac{\beta^\alpha}{\Gamma (\alpha)} x^{\alpha - 1} e^{- \beta x} \end{equation}
Statistical testing is used to know whether an hypothesis is significant. The main steps are as follows:
Common tests: z-test, t-test, ANOVA, \(\chi^2\)-test…