Bayesian multivariate linear regression

Bayesian approach to multivariate linear regression

Regression analysis
Part of a series on
Models
Linear regression Simple regression Polynomial regression General linear model
Generalized linear model Vector generalized linear model Discrete choice Binomial regression Binary regression Logistic regression Multinomial logistic regression Mixed logit Probit Multinomial probit Ordered logit Ordered probit Poisson
Multilevel model Fixed effects Random effects Linear mixed-effects model Nonlinear mixed-effects model
Nonlinear regression Nonparametric Semiparametric Robust Quantile Isotonic Principal components Least angle Local Segmented
Errors-in-variables
Estimation
Least squares Linear Non-linear
Ordinary Weighted Generalized Generalized estimating equation
Partial Total Non-negative Ridge regression Regularized
Least absolute deviations Iteratively reweighted Bayesian Bayesian multivariate Least-squares spectral analysis
Background
Regression validation Mean and predicted response Errors and residuals Goodness of fit Studentized residual Gauss–Markov theorem
Mathematics portal
v t e

In statistics, Bayesian multivariate linear regression is a Bayesian approach to multivariate linear regression, i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random variable. A more general treatment of this approach can be found in the article MMSE estimator.

Details

Consider a regression problem where the dependent variable to be predicted is not a single real-valued scalar but an m-length vector of correlated real numbers. As in the standard regression setup, there are n observations, where each observation i consists of k−1 explanatory variables, grouped into a vector $\mathbf {x} _{i}$ of length k (where a dummy variable with a value of 1 has been added to allow for an intercept coefficient). This can be viewed as a set of m related regression problems for each observation i: ${\begin{aligned}y_{i,1}&=\mathbf {x} _{i}^{\mathsf {T}}{\boldsymbol {\beta }}_{1}+\epsilon _{i,1}\\&\;\;\vdots \\y_{i,m}&=\mathbf {x} _{i}^{\mathsf {T}}{\boldsymbol {\beta }}_{m}+\epsilon _{i,m}\end{aligned}}$ where the set of errors $\{\epsilon _{i,1},\ldots ,\epsilon _{i,m}\}$ are all correlated. Equivalently, it can be viewed as a single regression problem where the outcome is a row vector $\mathbf {y} _{i}^{\mathsf {T}}$ and the regression coefficient vectors are stacked next to each other, as follows: $\mathbf {y} _{i}^{\mathsf {T}}=\mathbf {x} _{i}^{\mathsf {T}}\mathbf {B} +{\boldsymbol {\epsilon }}_{i}^{\mathsf {T}}.$

The coefficient matrix B is a $k\times m$ matrix where the coefficient vectors ${\boldsymbol {\beta }}_{1},\ldots ,{\boldsymbol {\beta }}_{m}$ for each regression problem are stacked horizontally: $\mathbf {B} ={\begin{bmatrix}{\begin{pmatrix}\\{\boldsymbol {\beta }}_{1}\\\\\end{pmatrix}}\cdots {\begin{pmatrix}\\{\boldsymbol {\beta }}_{m}\\\\\end{pmatrix}}\end{bmatrix}}={\begin{bmatrix}{\begin{pmatrix}\beta _{1,1}\\\vdots \\\beta _{k,1}\end{pmatrix}}\cdots {\begin{pmatrix}\beta _{1,m}\\\vdots \\\beta _{k,m}\end{pmatrix}}\end{bmatrix}}.$

The noise vector ${\boldsymbol {\epsilon }}_{i}$ for each observation i is jointly normal, so that the outcomes for a given observation are correlated: ${\boldsymbol {\epsilon }}_{i}\sim N(0,{\boldsymbol {\Sigma }}_{\epsilon }).$

We can write the entire regression problem in matrix form as: $\mathbf {Y} =\mathbf {X} \mathbf {B} +\mathbf {E} ,$ where Y and E are $n\times m$ matrices. The design matrix X is an $n\times k$ matrix with the observations stacked vertically, as in the standard linear regression setup: $\mathbf {X} ={\begin{bmatrix}\mathbf {x} _{1}^{\mathsf {T}}\\\mathbf {x} _{2}^{\mathsf {T}}\\\vdots \\\mathbf {x} _{n}^{\mathsf {T}}\end{bmatrix}}={\begin{bmatrix}x_{1,1}&\cdots &x_{1,k}\\x_{2,1}&\cdots &x_{2,k}\\\vdots &\ddots &\vdots \\x_{n,1}&\cdots &x_{n,k}\end{bmatrix}}.$

The classical, frequentists linear least squares solution is to simply estimate the matrix of regression coefficients ${\hat {\mathbf {B} }}$ using the Moore-Penrose pseudoinverse: ${\hat {\mathbf {B} }}=(\mathbf {X} ^{\mathsf {T}}\mathbf {X} )^{-1}\mathbf {X} ^{\mathsf {T}}\mathbf {Y} .$

To obtain the Bayesian solution, we need to specify the conditional likelihood and then find the appropriate conjugate prior. As with the univariate case of linear Bayesian regression, we will find that we can specify a natural conditional conjugate prior (which is scale dependent).

Let us write our conditional likelihood as $\rho (\mathbf {E} |{\boldsymbol {\Sigma }}_{\epsilon })\propto |{\boldsymbol {\Sigma }}_{\epsilon }|^{-n/2}\exp \left(-{\tfrac {1}{2}}\operatorname {tr} \left(\mathbf {E} ^{\mathsf {T}}\mathbf {E} {\boldsymbol {\Sigma }}_{\epsilon }^{-1}\right)\right),$ writing the error $\mathbf {E}$ in terms of $\mathbf {Y} ,\mathbf {X} ,$ and $\mathbf {B}$ yields $\rho (\mathbf {Y} |\mathbf {X} ,\mathbf {B} ,{\boldsymbol {\Sigma }}_{\epsilon })\propto |{\boldsymbol {\Sigma }}_{\epsilon }|^{-n/2}\exp(-{\tfrac {1}{2}}\operatorname {tr} ((\mathbf {Y} -\mathbf {X} \mathbf {B} )^{\mathsf {T}}(\mathbf {Y} -\mathbf {X} \mathbf {B} ){\boldsymbol {\Sigma }}_{\epsilon }^{-1})),$

We seek a natural conjugate prior—a joint density $\rho (\mathbf {B} ,\Sigma _{\epsilon })$ which is of the same functional form as the likelihood. Since the likelihood is quadratic in $\mathbf {B}$ , we re-write the likelihood so it is normal in $(\mathbf {B} -{\hat {\mathbf {B} }})$ (the deviation from classical sample estimate).

Using the same technique as with Bayesian linear regression, we decompose the exponential term using a matrix-form of the sum-of-squares technique. Here, however, we will also need to use the Matrix Differential Calculus (Kronecker product and vectorization transformations).

First, let us apply sum-of-squares to obtain new expression for the likelihood: $\rho (\mathbf {Y} |\mathbf {X} ,\mathbf {B} ,{\boldsymbol {\Sigma }}_{\epsilon })\propto |{\boldsymbol {\Sigma }}_{\epsilon }|^{-(n-k)/2}\exp(-\operatorname {tr} ({\tfrac {1}{2}}\mathbf {S} ^{\mathsf {T}}\mathbf {S} {\boldsymbol {\Sigma }}_{\epsilon }^{-1}))|{\boldsymbol {\Sigma }}_{\epsilon }|^{-k/2}\exp(-{\tfrac {1}{2}}\operatorname {tr} ((\mathbf {B} -{\hat {\mathbf {B} }})^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {X} (\mathbf {B} -{\hat {\mathbf {B} }}){\boldsymbol {\Sigma }}_{\epsilon }^{-1})),$ $\mathbf {S} =\mathbf {Y} -\mathbf {X} {\hat {\mathbf {B} }}$

We would like to develop a conditional form for the priors: $\rho (\mathbf {B} ,{\boldsymbol {\Sigma }}_{\epsilon })=\rho ({\boldsymbol {\Sigma }}_{\epsilon })\rho (\mathbf {B} |{\boldsymbol {\Sigma }}_{\epsilon }),$ where $\rho ({\boldsymbol {\Sigma }}_{\epsilon })$ is an inverse-Wishart distribution and $\rho (\mathbf {B} |{\boldsymbol {\Sigma }}_{\epsilon })$ is some form of normal distribution in the matrix $\mathbf {B}$ . This is accomplished using the vectorization transformation, which converts the likelihood from a function of the matrices $\mathbf {B} ,{\hat {\mathbf {B} }}$ to a function of the vectors ${\boldsymbol {\beta }}=\operatorname {vec} (\mathbf {B} ),{\hat {\boldsymbol {\beta }}}=\operatorname {vec} ({\hat {\mathbf {B} }})$ .

Write $\operatorname {tr} ((\mathbf {B} -{\hat {\mathbf {B} }})^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {X} (\mathbf {B} -{\hat {\mathbf {B} }}){\boldsymbol {\Sigma }}_{\epsilon }^{-1})=\operatorname {vec} (\mathbf {B} -{\hat {\mathbf {B} }})^{\mathsf {T}}\operatorname {vec} (\mathbf {X} ^{\mathsf {T}}\mathbf {X} (\mathbf {B} -{\hat {\mathbf {B} }}){\boldsymbol {\Sigma }}_{\epsilon }^{-1})$

Let $\operatorname {vec} (\mathbf {X} ^{\mathsf {T}}\mathbf {X} (\mathbf {B} -{\hat {\mathbf {B} }}){\boldsymbol {\Sigma }}_{\epsilon }^{-1})=({\boldsymbol {\Sigma }}_{\epsilon }^{-1}\otimes \mathbf {X} ^{\mathsf {T}}\mathbf {X} )\operatorname {vec} (\mathbf {B} -{\hat {\mathbf {B} }}),$ where $\mathbf {A} \otimes \mathbf {B}$ denotes the Kronecker product of matrices A and B, a generalization of the outer product which multiplies an $m\times n$ matrix by a $p\times q$ matrix to generate an $mp\times nq$ matrix, consisting of every combination of products of elements from the two matrices.

Then ${\begin{aligned}&\operatorname {vec} (\mathbf {B} -{\hat {\mathbf {B} }})^{\mathsf {T}}({\boldsymbol {\Sigma }}_{\epsilon }^{-1}\otimes \mathbf {X} ^{\mathsf {T}}\mathbf {X} )\operatorname {vec} (\mathbf {B} -{\hat {\mathbf {B} }})\\&=({\boldsymbol {\beta }}-{\hat {\boldsymbol {\beta }}})^{\mathsf {T}}({\boldsymbol {\Sigma }}_{\epsilon }^{-1}\otimes \mathbf {X} ^{\mathsf {T}}\mathbf {X} )({\boldsymbol {\beta }}-{\hat {\boldsymbol {\beta }}})\end{aligned}}$ which will lead to a likelihood which is normal in $({\boldsymbol {\beta }}-{\hat {\boldsymbol {\beta }}})$ .

With the likelihood in a more tractable form, we can now find a natural (conditional) conjugate prior.

Conjugate prior distribution

The natural conjugate prior using the vectorized variable ${\boldsymbol {\beta }}$ is of the form: $\rho ({\boldsymbol {\beta }},{\boldsymbol {\Sigma }}_{\epsilon })=\rho ({\boldsymbol {\Sigma }}_{\epsilon })\rho ({\boldsymbol {\beta }}|{\boldsymbol {\Sigma }}_{\epsilon }),$ where $\rho ({\boldsymbol {\Sigma }}_{\epsilon })\sim {\mathcal {W}}^{-1}(\mathbf {V} _{0},{\boldsymbol {\nu }}_{0})$ and $\rho ({\boldsymbol {\beta }}|{\boldsymbol {\Sigma }}_{\epsilon })\sim N({\boldsymbol {\beta }}_{0},{\boldsymbol {\Sigma }}_{\epsilon }\otimes {\boldsymbol {\Lambda }}_{0}^{-1}).$

Posterior distribution

Using the above prior and likelihood, the posterior distribution can be expressed as: ${\begin{aligned}\rho ({\boldsymbol {\beta }},{\boldsymbol {\Sigma }}_{\epsilon }|\mathbf {Y} ,\mathbf {X} )\propto {}&|{\boldsymbol {\Sigma }}_{\epsilon }|^{-({\boldsymbol {\nu }}_{0}+m+1)/2}\exp {(-{\tfrac {1}{2}}\operatorname {tr} (\mathbf {V} _{0}{\boldsymbol {\Sigma }}_{\epsilon }^{-1}))}\\&\times |{\boldsymbol {\Sigma }}_{\epsilon }|^{-k/2}\exp {(-{\tfrac {1}{2}}\operatorname {tr} ((\mathbf {B} -\mathbf {B} _{0})^{\mathsf {T}}{\boldsymbol {\Lambda }}_{0}(\mathbf {B} -\mathbf {B} _{0}){\boldsymbol {\Sigma }}_{\epsilon }^{-1}))}\\&\times |{\boldsymbol {\Sigma }}_{\epsilon }|^{-n/2}\exp {(-{\tfrac {1}{2}}\operatorname {tr} ((\mathbf {Y} -\mathbf {XB} )^{\mathsf {T}}(\mathbf {Y} -\mathbf {XB} ){\boldsymbol {\Sigma }}_{\epsilon }^{-1}))},\end{aligned}}$ where $\operatorname {vec} (\mathbf {B} _{0})={\boldsymbol {\beta }}_{0}$ . The terms involving $\mathbf {B}$ can be grouped (with ${\boldsymbol {\Lambda }}_{0}=\mathbf {U} ^{\mathsf {T}}\mathbf {U}$ ) using: ${\begin{aligned}&\left(\mathbf {B} -\mathbf {B} _{0}\right)^{\mathsf {T}}{\boldsymbol {\Lambda }}_{0}\left(\mathbf {B} -\mathbf {B} _{0}\right)+\left(\mathbf {Y} -\mathbf {XB} \right)^{\mathsf {T}}\left(\mathbf {Y} -\mathbf {XB} \right)\\={}&\left({\begin{bmatrix}\mathbf {Y} \\\mathbf {U} \mathbf {B} _{0}\end{bmatrix}}-{\begin{bmatrix}\mathbf {X} \\\mathbf {U} \end{bmatrix}}\mathbf {B} \right)^{\mathsf {T}}\left({\begin{bmatrix}\mathbf {Y} \\\mathbf {U} \mathbf {B} _{0}\end{bmatrix}}-{\begin{bmatrix}\mathbf {X} \\\mathbf {U} \end{bmatrix}}\mathbf {B} \right)\\={}&\left({\begin{bmatrix}\mathbf {Y} \\\mathbf {U} \mathbf {B} _{0}\end{bmatrix}}-{\begin{bmatrix}\mathbf {X} \\\mathbf {U} \end{bmatrix}}\mathbf {B} _{n}\right)^{\mathsf {T}}\left({\begin{bmatrix}\mathbf {Y} \\\mathbf {U} \mathbf {B} _{0}\end{bmatrix}}-{\begin{bmatrix}\mathbf {X} \\\mathbf {U} \end{bmatrix}}\mathbf {B} _{n}\right)+\left(\mathbf {B} -\mathbf {B} _{n}\right)^{\mathsf {T}}\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0}\right)\left(\mathbf {B} -\mathbf {B} _{n}\right)\\={}&\left(\mathbf {Y} -\mathbf {X} \mathbf {B} _{n}\right)^{\mathsf {T}}\left(\mathbf {Y} -\mathbf {X} \mathbf {B} _{n}\right)+\left(\mathbf {B} _{0}-\mathbf {B} _{n}\right)^{\mathsf {T}}{\boldsymbol {\Lambda }}_{0}\left(\mathbf {B} _{0}-\mathbf {B} _{n}\right)+\left(\mathbf {B} -\mathbf {B} _{n}\right)^{\mathsf {T}}\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0}\right)\left(\mathbf {B} -\mathbf {B} _{n}\right),\end{aligned}}$ with $\mathbf {B} _{n}=\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0}\right)^{-1}\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} {\hat {\mathbf {B} }}+{\boldsymbol {\Lambda }}_{0}\mathbf {B} _{0}\right)=\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0}\right)^{-1}\left(\mathbf {X} ^{\mathsf {T}}\mathbf {Y} +{\boldsymbol {\Lambda }}_{0}\mathbf {B} _{0}\right).$

This now allows us to write the posterior in a more useful form: ${\begin{aligned}\rho ({\boldsymbol {\beta }},{\boldsymbol {\Sigma }}_{\epsilon }|\mathbf {Y} ,\mathbf {X} )\propto {}&|{\boldsymbol {\Sigma }}_{\epsilon }|^{-({\boldsymbol {\nu }}_{0}+m+n+1)/2}\exp {(-{\tfrac {1}{2}}\operatorname {tr} ((\mathbf {V} _{0}+(\mathbf {Y} -\mathbf {XB_{n}} )^{\mathsf {T}}(\mathbf {Y} -\mathbf {XB_{n}} )+(\mathbf {B} _{n}-\mathbf {B} _{0})^{\mathsf {T}}{\boldsymbol {\Lambda }}_{0}(\mathbf {B} _{n}-\mathbf {B} _{0})){\boldsymbol {\Sigma }}_{\epsilon }^{-1}))}\\&\times |{\boldsymbol {\Sigma }}_{\epsilon }|^{-k/2}\exp {(-{\tfrac {1}{2}}\operatorname {tr} ((\mathbf {B} -\mathbf {B} _{n})^{\mathsf {T}}(\mathbf {X} ^{T}\mathbf {X} +{\boldsymbol {\Lambda }}_{0})(\mathbf {B} -\mathbf {B} _{n}){\boldsymbol {\Sigma }}_{\epsilon }^{-1}))}.\end{aligned}}$

This takes the form of an inverse-Wishart distribution times a Matrix normal distribution: $\rho ({\boldsymbol {\Sigma }}_{\epsilon }|\mathbf {Y} ,\mathbf {X} )\sim {\mathcal {W}}^{-1}(\mathbf {V} _{n},{\boldsymbol {\nu }}_{n})$ and $\rho (\mathbf {B} |\mathbf {Y} ,\mathbf {X} ,{\boldsymbol {\Sigma }}_{\epsilon })\sim {\mathcal {MN}}_{k,m}(\mathbf {B} _{n},{\boldsymbol {\Lambda }}_{n}^{-1},{\boldsymbol {\Sigma }}_{\epsilon }).$

The parameters of this posterior are given by: $\mathbf {V} _{n}=\mathbf {V} _{0}+(\mathbf {Y} -\mathbf {XB_{n}} )^{\mathsf {T}}(\mathbf {Y} -\mathbf {XB_{n}} )+(\mathbf {B} _{n}-\mathbf {B} _{0})^{\mathsf {T}}{\boldsymbol {\Lambda }}_{0}(\mathbf {B} _{n}-\mathbf {B} _{0})$ ${\boldsymbol {\nu }}_{n}={\boldsymbol {\nu }}_{0}+n$ $\mathbf {B} _{n}=(\mathbf {X} ^{\mathsf {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0})^{-1}(\mathbf {X} ^{\mathsf {T}}\mathbf {Y} +{\boldsymbol {\Lambda }}_{0}\mathbf {B} _{0})$ ${\boldsymbol {\Lambda }}_{n}=\mathbf {X} ^{\mathsf {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0}$

References

^ Peter E. Rossi, Greg M. Allenby, Rob McCulloch. Bayesian Statistics and Marketing. John Wiley & Sons, 2012, p. 32.

This article includes a list of general references, but it lacks sufficient corresponding inline citations. Please help to improve this article by introducing more precise citations. (November 2010) (Learn how and when to remove this message)

Box, G. E. P.; Tiao, G. C. (1973). "8". Bayesian Inference in Statistical Analysis. Wiley. ISBN 0-471-57428-7.
Geisser, S. (1965). "Bayesian Estimation in Multivariate Analysis". The Annals of Mathematical Statistics. 36 (1): 150–159. JSTOR 2238083.
Tiao, G. C.; Zellner, A. (1964). "On the Bayesian Estimation of Multivariate Regression". Journal of the Royal Statistical Society. Series B (Methodological). 26 (2): 277–285. JSTOR 2984424.

Categories:

Misplaced Pages