Sunday, July 4, 2010

Machine Learning: A Probabilistic Approach


- A new textbook by Kevin Murphy (in preparation,
to be published by MIT Press, c. December 2011)


- Introduction to Machine Learning
- Publications by Kevin Murphy
- UBC Machine Learning Reading Group (MLRG)
- Kevin Murphy @ UBC (great resource for students)
- Kevin Murphy / Teaching
- Graduate, CS 540 (Machine learning) Fall 2008
- Graduate, Stat 406 (Algorithms for Classification and Prediction)

- Causal models as conditional density models
- Product Partition Models for Modelling
Changing Dependency Structure in Time Series



1 Introduction
1.1 Introduction
1.2 Classification
1.2.1 Example
1.2.2 Probabilistic output
1.2.3 Decision trees
1.2.4 K-nearest neighbor classifier
1.2.5 Logistic regression
1.2.6 Linear separability
1.2.7 Transforming the input variables
1.2.8 Kernel functions
1.2.9 Multinomial logistic regression
1.2.10 No free lunch theorem
1.2.11 Real world applications
1.3 Regression
1.3.1 Linear regression
1.3.2 Maximum likelihood and least squares
1.3.3 Real world applications
1.4 Unsupervised learning
1.4.1 Discovering clusters
1.4.2 Discovering latent factors
1.4.3 Discovering graph structure
1.4.4 Discovering relevant variables
1.4.5 Data compression
1.4.6 Data imputation
1.4.7 Anomaly detection
1.5 Other kinds of machine learning
1.5.1 Relational learning
1.5.2 Semi-supervised learning
1.5.3 Reinforcement learning
1.6 Overfitting
1.6.1 Examples of overfitting
1.6.2 The benefits of more data
1.6.3 Regularization
1.7 A probabilistic approach: what and why?
1.7.1 Degrees of Bayesianism
1.7.2 Representing uncertainty
1.7.3 Predicting the future
1.7.4 The mall N, large D problem
1.7.5 Model selection
1.7.6 Discussion and further reading

==============
I Probability
==============

2 Probability I
2.1 Introduction
2.2 What is probability?
2.3 Discrete random variables
2.4 Continuous random variables
2.5 Summaries of distributions
2.6 Some useful continuous distributions

3 Probability II
3.1 Introduction
3.2 Joint probability distributions
3.3 Functions of random variable
3.4 The exponential family *
3.5 Exchangeability *

4 The multivariate Gaussian and friends
4.1 Introduction
4.2 Definition and basic properties
4.3 Mahalanobis distance
4.4 Eigenanalysis of covariance matrices
4.5 Probabilistic inference
4.6 Information form *
4.7 Linear Gaussian systems
4.8 The multivariate Student T distribution *
4.9 Wishart distribution

5 Generative models for classification
5.1 Introduction
5.2 Discriminant analysis
5.3 Robust discriminant analysis
5.4 Naive Bayes classifiers
5.5 Log-sum-exp trick
5.6 Generative vs discriminative classifiers

6 Markov models
6.1 Introduction
6.2 Transition matrix
6.3 Language modeling
6.4 Stationary distribution
6.5 Google's PageRank algorithm
6.6 Conditions under w/c a stationary distribution exists: finite-state case
6.7 Conditions under w/c a stationary distribution exists: general case *

7 Directed graphical models (Unfinished)
7.1 Introduction
7.2 Example: medical diagnosis
7.3 Plate notation
7.4 Inference
7.5 Using DGMs to make generative classifiers
7.6 Conditional independence properties of DAGs
7.7 Casual interpretation of DAGs *

8 Information theory
8.1 Introduction
8.2 Entropy
8.3 Mutual information
8.4 Relative entropy (KL divergence)
8.5 Jensen's inequality and its consequences
8.6 Maximum entropy principle
8.7 Data compression, probability and entropy
8.8 Minimum description length (MDL) principle

==============
II Statistics
==============

9 Maximum likelihood estimation
9.1 Introduction
9.2 MLE for linear regression
9.3 MLE for univariate Gaussian distribution
9.4 MLE for binomial and Bernoulli distributions
9.5 MLE for multinomial and multinoulli distributions
9.6 MLE for a naive Bayes classifier
9.7 MLE for Markov models
9.8 MLE for DGMs
9.9 MLE for multivariate Gaussian
9.10 MLE for discriminant analysis
9.11 Sufficient statistics
9.12 MLE for the exponential family *
9.13 Some problems with maximum likelihood estimation

10 Bayesian inference for single parameter models
10.1 Statistical inference
10.2 Bayesian model of human concept learning
(inference of a discrete parameter)
10.3 Tossing coins (inference for the binomial rate parameter)
10.4 Summarizing posterior distributions
10.5 Why not just use MAP estimations? *
10.6 Priors
10.7 The marginal likelihood (evidence)

11 Bayesian inference for multi parameter models
11.1 Bayesian fitting of discrete variable models
11.2 Bayesian fitting of univariate Gaussian
11.3 Bayesian fitting of multivariate Gaussian *
11.4 Bayesian fitting of discriminant analysis models
11.5 Bayesian fitting of a linear regression model
11.6 Bayesian fitting of the exponential family *

12 Decision theory
12.1 Introduction
12.2 Bayesian decision theory
12.3 Frequentist decision theory *
12.4 Empirical risk minimization *
12.5 The false positive vs false negative tradeoff

13 Bayesian model selection and checking
13.1 Introduction
13.2 Bayes factor
13.3 Testing equality of two Binomial rate parameters
13.4 Testing equality of two Gaussian means (Bayesian T-tests) *
13.5 Testing for differentially expressed genes *
13.6 Testing for Markovianity
13.7 BIC approximation to log marginal likelihood
13.8 Jeffreys-Lindley paradox *
13.9 Model checking
13.10 Posterior-predictive p-values

14 Empirical and Hierarchical Bayes
14.1 Empirical Bayes
14.2 Empirical Bayes for linear regression
14.3 Hierarchical Bayes *
14.4 Multiple binomial variables
14.5 Multiple Gaussian variables
14.6 Bayesian fitting of Markov language models
14.7 Fitting multiple related linear regressions

================
III Computation
================

15 Unconstrained optimization
15.1 Introduction
15.2 Kinds of optimization problems
15.3 Local vs global optima
15.4 Computational complexity of finding an optimum *
15.5 Overview of algorithms
15.6 Examples of optimization problems
15.7 First-order methods
15.8 Second-order methods
15.9 Comparison of algorithms
15.10 Mathlab software
15.11 Gradient-free unconstrained optimization

16 Expectation maximization (EM) algorithm
16.1 Introduction
16.2 Basic idea
16.3 Example: computing MLE for MVN from partially observed data
16.4 Example: MLE for the multivariate Student distribution *
16.5 EM theory *
16.6 Bound optimization / MM algorithms
16.7 EM variants *
16.8 Convergence rate of EM *

17 Deterministic algorithms for approximate inference (Unfinished)
17.1 Introduction
17.2 Laplace approximation
17.3 Variational Bayes *
17.4 Bits-back coding
17.5 Variational free energy
17.6 The mean field method
17.7 VB for univariate Gaussian
17.8 Forward or reverse KL?
17.9 Software for variational inference

18 Stochastic algorithms for approximate inference (Unfinished)
18.1 Introduction
18.2 Generating random samples from a univariate distribution
18.3 Importance sampling (Unfinished)
18.4 Markov Chain Monte Carlo (MCMC)
18.5 Gibbs sampling
18.6 Metropolis Hastings algorithm
18.7 Accuracy of MCMC
18.8 Convergence
18.9 More advanced MCMC algorithms

19 Constrained optimization
19.1 Introduction
19.2 Lagrange multipliers and the KKT conditions
19.3 Examples
19.4 Primal-dual problems *
19.5 Overview of algorithms
19.6 Conjugate duality

20 Discrete optimization
20.1 Introduction
20.2 Dynamic programming
20.3 Branch and bound
20.4 Hill climbing/coordinate ascent
20.5 Stochastic local search
20.6 Tabu search
20.7 Evolutionary algorithms
20.8 Matlab software

============================
IV More supervised learning
============================

21 Linear regression
21.1 Introduction
21.2 Maximum likelihood and least squared revisited
21.3 Robust regression *
21.4 Censored regression *
21.5 Ridge regression revisited
21.6 Bayesian linear regression (unknown o^2)
21.7 Empirical Bayes for linear regression
21.8 Variational Bayes for linear regression *
21.9 Kernel smoothers *
21.10 Semi-parametric regression *

22 Generalized linear models
22.1 Introduction
22.2 Multinomial logistic regression
22.3 Bayesian logistic regression
22.4 Empirical Bayes for logistic regression
22.5 Variational Bayes for logistic regression
22.6 Generalized linear models
22.7 Probit regression *

23 Neural networks
23.1 Introduction
23.2 Kinds of neural networks
23.3 Parameter estimation, or, how to train a neural net
23.4 Implementation *
23.5 Bayesian neural networks *

24 Variable selection
24.1 Introduction
24.2 Detecting marginally relevant variables
24.3 Bayesian variable selection
24.4 Subset selection
24.5 l1 regularization
24.6 Variants on l1 regularization *
24.7 Non-convex regularization
24.8 Automatic relevancy determination (ARD)

25 Sparse kernel machines
25.1 Introduction
25.2 Kernels
25.3 The kernel trick
25.4 Support vector machines
25.5 Experimental comparison of kernel methods
25.6 When should you use kernel methods?

26 Gaussian processes (Unfinished)
26.1 Introduction
26.2 GPs for regression
26.3 GPs for classification (Unfinished)

27 Trees, forests and boosting (Unfinished)
27.1 Introduction

=============================
V More unsupervised learning
=============================

28 Mixture models
28.1 Introduction
28.2 ML estimation
28.3 MAP estimation
28.4 Selecting the number of mixture components
28.5 VB for mixtures of Gaussians
28.6 Non-parametric density estimation
28.7 Using mixture models for classification

29 Latent factor models (Unfinished)
29.1 Introduction
29.2 Factor analysis
29.3 Mixture of factor analyzers
29.4 Principal components analysis (PCA)
29.5 EPCA
29.6 Variational EM for discrete PPCA
29.7 Variational Bayes for PPCA

30 Hidden Markov models
30.1 What are HMMs?
30.2 The main inferential tasks, and some applications
30.3 The forwards-backwards algorithm
30.4 The Viterbi algorithm
30.5 Forwards filtering, backwards sampling
30.6 Parameter estimation
30.7 HMM variants *

31 State space models
31.1 Linear dynamical systems
31.2 Kalman filtering
31.3 Kalman smoothing
31.4 Paramter estimation *
31.5 Inference in nonlinear, non-Gaussian systems *
31.6 Particle filtering

=========================
VI More graphical models
=========================

32 Undirected graphical models (Unfinished)
32.1 Introduction
32.2 Representing a joint distribution with a DGM
32.3 Inference
32.4 Learning (parameter estimation)
32.5 MRFs
32.6 CI properties of UGMs
32.7 Representing the joint distribution of a UGM
32.8 Conditional random fields
32.9 Inference
32.10 Learning (parameter estimation)

33 Inference in graphical models (Unfinished)
33.1 Introduction
33.2 Message passing (belief propagation)
33.3 Message passing (belief propagation) on trees
33.4 Variable elimination
33.5 Junction tree algorithm
33.6 Computational hardness of discrete inference
33.7 Inference in Gaussian models
33.8 Approximate inference
33.9 Mean field revisited
33.10 Loopy belief propagation (Unfinished)
33.11 Expectation propagation (Unfinished)

34 Network structure learning (Unfinished)
34.1 Structure learning for trees
34.2 Structure learning for DAGMs
34.3 Structure learning for Gaussian UGMs
34.4 Structure learning for discrete UGMs
34.5 Dependency networks

===========================
VII Supplementary material
===========================

35 Notation
35.1 General notation
35.2 Graph theory notation
35.3 Probability distributions
35.4 List of commonly used abbreviations
35.5 Big-O notation

36 Non-Bayesian inference (frequentist statistics)
36.1 Introduction
36.2 Sampling distribution of estimators
36.3 Desirable properties of estimators
36.4 Confidence intervals
36.5 Hypothesis testing
36.6 Pathologies of frequentist statistics
36.7 Why isn't everyone a Bayesian?

37 Linear algebra
37.1 Introduction
37.2 Basic notation
37.3 Matrix Multiplication
37.4 Operations of matrices
37.5 Properties of matrices
37.6 Special kinds of matrices
37.7 Cholesky decomposition
37.8 QR decomposition
37.9 Eigenvalue decomposition (EVD)
37.10 Singular value decomposition (SVD)

38 Calculus
38.1 Single variable calculus
38.2 Matrix calculus