Regression and learning models

In this series of posts, I will informally discuss some basic things in machine learning theory which I’ve learnt through my own research experience, lectures, etc. Feel free to leave your comments and share your thoughts in the comment section below.

In this very first post, let’s get started with one of the most standard problems in supervised learning setting, called regression and some common learning models.

Regression as functional approximation

We consider the problem of estimating an unknown function $f(x)$ , using a set of samples $\{(x_i, y_i)\}$ , where $y_i = f(x_i) + e_i$ , and $e_i$ is the i.i.d Gaussian noises. This is a common formulation of regression problem in standard supervised learning setup, where $\{(x_i, y_i)\}$ is often mentioned as training samples, $x_i \in \mathcal{R}^d$ is d-dimensional input vector, and $y_i$ is its corresponding output. For simplicity, we consider only the case where $y_i$ is scalar.
One way to solve the problem above is to search for the true function $f(x)$ in a set of functions that is parameterized by parameter vector $\theta$ (a.k.a., a parametric model indexed by parameter $\theta$ ). Obviously, if we don’t have any prior knowledge (about the form of $f(x)$) other than training samples, it’s almost impossible to obtain exact true function $f(x)$. Instead, we try to find a function in a given model which best approximates $f(x)$. That’s why we can see regression as functional approximation problem. We often “learn” the parameter $\theta$ from training samples by casting our problem into an optimization problem, e.g., minimizing the approximation error. An estimation of $f(x)$, denoted $\hat{f}(x)$ can be obtained by substituting optimized parameter into the model formulation. I will return to this point in later posts.
Below, we discuss some commonly used parametric models, with general form $\{f(x; \boldsymbol{\theta})~|~ \boldsymbol{\theta} = (\theta_1,...,\theta_b)^\top\}$ .

Learning model

1. Linear model
Instead of a linear-in-input-feature model, which is often introduced in some stats/ML introductory books/courses, we consider a more general linear-in-parameter model:

$\:\:\:\:\:\:\:\:f(x; \boldsymbol{\theta}) = \sum_{j=1}^b\theta_j\psi_j(x)$ , where $x \in \mathcal{R}^d$ .

This linear-in-parameter model includes the former as a special case. We might think this type of model is quite limited due to its linearity, but actually it’s quite flexible. Particularly, we can customize the basis functions ${\psi_j(x)}$ as freely as we want based on specific problems. For examples, polynomial basis functions or trigonometric basis functions are common choices when d = 1. For high dimensional case, some powerful linear models can be used:
+ Multiplicative model: It can model very complicated functions in high dimension, but due to very large number of parameters (exponential order w.r.t to the dimensionality d), it can only be applied to not-so-high dimensional case.

$f(x; \boldsymbol{\theta}) = \sum_{j_1}...\sum_{j_d}\theta_{j_1,...,j_d}\psi_{j_1}(x^{(1)})...\psi_{j_d}(x^{(d)})$ .

Additive model: much simpler with smaller number of parameters (linear order w.r.t to the dimensionality d) than multiplicative model. Obviously, its expressiveness is more limited than multiplicative model.

$\:\:\:\:\:\:\:\:\:\:f(x; \boldsymbol{\theta}) = \sum_{k=1}^d\sum_j \theta_{k,j}\psi_j(x^{(k)})$ .

2. Kernel model

$\:\:\:\:\:\:\:\:\:\:f(x; \boldsymbol{\theta}) = \sum_{j=1}^n \theta_{j} K(x, x_j)$ .

It is linear-in-parameter but unlike the linear model discussed above, its basis functions depend on training samples $\{x_j\}$ .
+ The number of parameters is generally independent of the dimensionality d of input.
+ It can be seen as a linear-in-parameter model.
+ It depends on the training samples, and thus its properties is a bit different from ordinary linear model. That’s why it’s also known as non-parametric model. Discussion on non-parametric model is beyond the scope of this post. In this post, we do not go into the detailed and complicated analysis, then unless otherwise stated, we consider kernel model as a specific case of linear model.
+ It can capture and incoporate characteristics of training samples. This might be useful, but on the other hand, it might be sensitive to the noisy training samples.

3. Non-linear model
Simply put, every non-linear w.r.t parameters is called non-linear model. For examples, hierachical model (a multi-layer model in perceptron, neuron network, etc.) is well-known, given the popularity of deep learning.

To be continuted and updated later!

Random thoughts

Share to Learn, Learn to Share

Regression and Learning Models

Regression as functional approximation

Learning model