Basic Machine Learning Models

Supervised Learning

k-Nearest Neighbor (kNN)

Solves: Classification, Regression

Method: kNN estimates new point labels by aggregating the labels of the \(k\)-closest points in the dataset.

It is a good idea to choose a \(k\) different than the number of classes to avoid draws.

Pos/Cons:

  • Simplicity.
  • If the dataset is very big, it can be slow to find closest point.

Characteristics: discriminative, non-parametric

Decision Trees

Solves: Classification, Regression

Method: Learn decision rules based on the data features. These decisions are encoded by a tree where each node represents condition and its branches the possible outcomes. The conditions are chosen iteratively granting a higher information gain (the split of data which results in the lowest possible entropy entropy state).

Pos/Cons:

  • Intuitive and interpretable.
  • No need to pre-process the data.
  • No bias (no previous assumptions on the model are made)
  • Very effective, often used in winning solution in Kaggle
  • High variance (results entirely depend on training data).

This high-variance issue is often addressed with variance-reduction ensemble methods. As a matter of fact, there is a technique specially tailored to them called random forest: A set of different trees are trained with different subsets of data (similar idea to bagging), and also different subsets of features.

Characteristics: discriminative

Naive Bayes

Solves: Classification

Method: Given a dataset of labelled points in a \(d\)-dim space a naive Bayes classifier approximates the probability of each class of a new point \(\vec{x} = (x^1, ..., x^d)\) assuming independence on its features:

Naive because it assumes independence between features.

\begin{equation} p(\vec{x}) = p(x^1, …, x^d) \simeq p(x^1) … p(x^d) \end{equation}

By applying the Bayes theorem:

\begin{equation} p(c_i \mid \vec{x}) = \frac{p(\vec{x} \mid c_i) p(c_i)}{p(\vec{x})} \simeq \frac{p(x^1 \mid c_i) … p(x^d \mid c_i) p(c_i)}{p(x^1)…p(x^d)} \end{equation}

We can then estimate:

  • \(p(x^j \mid c_i)\) as the ratio of times \(x^j\) appears with and without \(c_i\) in the dataset.
  • \(p(c_i)\) as the ratio of points classified as \(c^i\) in the dataset.

Notice that we don’t actually need to compute the evidence (denominator) as it only acts as a normalization factor of the probabilities. It is constant throughout all classes.

Pos/Cons:

  • Simplicity, easy implementation, speed…
  • Works well in high dimensions.
  • Features in real world are not independent.

Characteristics: generative, non-parametric

Support Vector Machines (SVMs)

Solves: Binary classification

Method: Learn the hyperplane parameters which “better” split the data. It frames it as a convex optimization problem with the objective of finding the largest margin within the two classes of points. A set of points is chosen from each group to optimize the hyperplane position, these points are known as support vectors.

  • Maximum Margin Classifier attempts to find a “hard” boundary between the two data groups. It is however very susceptible to training outliers (high-variance problem).

  • Soft Margin approach allows for the miss-classification of points, it is more biased but better performing. Often cross-validation is used to select the support vectors which yield better results.

Schematically, SVM has this form:

\begin{equation} \vec{x} \rightarrow \text{Linear function} < \vec{x}, \vec{w} > \rightarrow \begin{cases} if \geq 1 \rightarrow \text{class 1} \newline if \leq -1 \rightarrow \text{class -1} \end{cases} \end{equation}

Kernel

Often, the data is not linearly-separable. We can apply a non-linear transformation into a (possibly higher-dimensional) space and perform the linear separation there. We call this non-linear transformation kernel.

Pos/Cons:

  • Works well on high-dim spaces.
  • Works well with small datasets.
  • Need to choose a kernel.

Kernel trick: One can frame the problem in such a way that the only information needed is the dot product between points. For some kernels we can calculate the dot product of the transformation of two points without actually needing to transform them, which makes the overall computation much more efficient. We already saw this idea in dimensionality reduction algorithms.

Characteristics: discriminative

Logistic Regression

Solves: Binary classification

Method: The model has two steps:

  1. Apply a linear function to the input \(\vec{x}\) (parametrized as \(\vec{w}\)).
  2. Apply a sigmoid function to this output: \(f(x) = \frac{e^x}{e^x + 1}\).

Therefore:

\begin{equation} \hat p (\vec{x}) = \frac{e^{< \vec{w}, \vec{x} >}}{e^{< \vec{w}, \vec{x} >} + 1} \end{equation}

And uses maximum likelihood estimation (MLE) to learn the parameters \(\vec{w}\). Notice its similarity with SVMs:

\begin{equation} \vec{x} \rightarrow \text{Linear function} < \vec{x}, \vec{w} > \rightarrow \text{SIGMOID} \rightarrow \begin{cases} if \geq 0.5 \rightarrow \text{class 1} \newline if \leq 0.5 \rightarrow \text{class 0} \end{cases} \end{equation}

We can also think of this as a single-neuron ANN with a SIGMOID activation function.

We therefore find the parameters \(w\) through MLE.

Pos/Cons:

  • Simplicity.
  • Susceptible to outliers.
  • Negatively affected by high correlations in predictors.
  • Limited to linear boundaries.
  • Generalized and easily outperformed by ANNs.

Characteristics: discriminative

Linear Regression

Solves: Regression

Method: Learn parameters of a linear transformation using least squares

Ridge regression

Lasso regression

Pos/Cons:

  • Works well with small datasets.
  • Susceptible to outliers, high correlations in predictors.

Characteristics: discriminative

Unsupervised Learning

Clustering

k-means

Method: Randomly place \(k\) centroids and iteratively repeat these two steps until convergence:

  • Assign each point of the dataset to the closest centroid.
  • Move the centroid to the center of mass of its assigned points.

Pos/Cons:

  • Simple
  • Number of clusters need to be pre-set
  • Isotropic: Only works for spherical clusters
  • Optimal not guaranteed

Expectation Maximization