- A new textbook by Kevin Murphy (in preparation, 
  to be published by MIT Press, c. December 2011)
- Introduction to Machine Learning
- Publications by Kevin Murphy
- UBC Machine Learning Reading Group (MLRG)
- Kevin Murphy @ UBC (great resource for students)
- Kevin Murphy / Teaching
- Graduate, CS 540 (Machine learning) Fall 2008
- Graduate, Stat 406 (Algorithms for Classification and Prediction)
- Causal models as conditional density models
- Product Partition Models for Modelling 
  Changing Dependency Structure in Time Series

1 Introduction
  1.1 Introduction
  1.2 Classification
    1.2.1 Example
    1.2.2 Probabilistic output
    1.2.3 Decision trees
    1.2.4 K-nearest neighbor classifier
    1.2.5 Logistic regression
    1.2.6 Linear separability
    1.2.7 Transforming the input variables
    1.2.8 Kernel functions
    1.2.9 Multinomial logistic regression
    1.2.10 No free lunch theorem
    1.2.11 Real world applications
  1.3 Regression
    1.3.1 Linear regression
    1.3.2 Maximum likelihood and least squares
    1.3.3 Real world applications
  1.4 Unsupervised learning
    1.4.1 Discovering clusters
    1.4.2 Discovering latent factors
    1.4.3 Discovering graph structure
    1.4.4 Discovering relevant variables
    1.4.5 Data compression
    1.4.6 Data imputation
    1.4.7 Anomaly detection
  1.5 Other kinds of machine learning
    1.5.1 Relational learning
    1.5.2 Semi-supervised learning
    1.5.3 Reinforcement learning
  1.6 Overfitting
    1.6.1 Examples of overfitting
    1.6.2 The benefits of more data
    1.6.3 Regularization
  1.7 A probabilistic approach: what and why?
    1.7.1 Degrees of Bayesianism
    1.7.2 Representing uncertainty 
    1.7.3 Predicting the future
    1.7.4 The mall N, large D problem
    1.7.5 Model selection
    1.7.6 Discussion and further reading
  
==============
I Probability
==============
 
2 Probability I
  2.1 Introduction
  2.2 What is probability?
  2.3 Discrete random variables
  2.4 Continuous random variables
  2.5 Summaries of distributions
  2.6 Some useful continuous distributions
 
3 Probability II
  3.1 Introduction
  3.2 Joint probability distributions
  3.3 Functions of random variable
  3.4 The exponential family *
  3.5 Exchangeability *
 
4 The multivariate Gaussian and friends
  4.1 Introduction
  4.2 Definition and basic properties
  4.3 Mahalanobis distance
  4.4 Eigenanalysis of covariance matrices
  4.5 Probabilistic inference
  4.6 Information form *
  4.7 Linear Gaussian systems
  4.8 The multivariate Student T distribution *
  4.9 Wishart distribution
 
5 Generative models for classification
  5.1 Introduction
  5.2 Discriminant analysis
  5.3 Robust discriminant analysis
  5.4 Naive Bayes classifiers
  5.5 Log-sum-exp trick
  5.6 Generative vs discriminative classifiers
 
6 Markov models
  6.1 Introduction
  6.2 Transition matrix
  6.3 Language modeling
  6.4 Stationary distribution
  6.5 Google's PageRank algorithm
  6.6 Conditions under w/c a stationary distribution exists: finite-state case
  6.7 Conditions under w/c a stationary distribution exists: general case *
 
7 Directed graphical models (Unfinished)
  7.1 Introduction
  7.2 Example: medical diagnosis
  7.3 Plate notation
  7.4 Inference
  7.5 Using DGMs to make generative classifiers
  7.6 Conditional independence properties of DAGs
  7.7 Casual interpretation of DAGs *
 
8 Information theory
  8.1 Introduction
  8.2 Entropy
  8.3 Mutual information
  8.4 Relative entropy (KL divergence)
  8.5 Jensen's inequality and its consequences
  8.6 Maximum entropy principle
  8.7 Data compression, probability and entropy
  8.8 Minimum description length (MDL) principle
 
==============
II Statistics
==============
 
9 Maximum likelihood estimation
  9.1 Introduction
  9.2 MLE for linear regression
  9.3 MLE for univariate Gaussian distribution
  9.4 MLE for binomial and Bernoulli distributions
  9.5 MLE for multinomial and multinoulli distributions
  9.6 MLE for a naive Bayes classifier
  9.7 MLE for Markov models
  9.8 MLE for DGMs
  9.9 MLE for multivariate Gaussian
  9.10 MLE for discriminant analysis
  9.11 Sufficient statistics
  9.12 MLE for the exponential family *
  9.13 Some problems with maximum likelihood estimation
 
10 Bayesian inference for single parameter models
  10.1 Statistical inference
  10.2 Bayesian model of human concept learning 
     (inference of a discrete parameter)
  10.3 Tossing coins (inference for the binomial rate parameter)
  10.4 Summarizing posterior distributions
  10.5 Why not just use MAP estimations? *
  10.6 Priors
  10.7 The marginal likelihood (evidence)
 
11 Bayesian inference for multi parameter models
  11.1 Bayesian fitting of discrete variable models
  11.2 Bayesian fitting of univariate Gaussian
  11.3 Bayesian fitting of multivariate Gaussian *
  11.4 Bayesian fitting of discriminant analysis models
  11.5 Bayesian fitting of a linear regression model
  11.6 Bayesian fitting of the exponential family *
 
12 Decision theory
  12.1 Introduction
  12.2 Bayesian decision theory
  12.3 Frequentist decision theory *
  12.4 Empirical risk minimization *
  12.5 The false positive vs false negative tradeoff
 
13 Bayesian model selection and checking
  13.1 Introduction
  13.2 Bayes factor
  13.3 Testing equality of two Binomial rate parameters
  13.4 Testing equality of two Gaussian means (Bayesian T-tests) *
  13.5 Testing for differentially expressed genes *
  13.6 Testing for Markovianity
  13.7 BIC approximation to log marginal likelihood
  13.8 Jeffreys-Lindley paradox *
  13.9 Model checking
  13.10 Posterior-predictive p-values
 
14 Empirical and Hierarchical Bayes
  14.1 Empirical Bayes
  14.2 Empirical Bayes for linear regression
  14.3 Hierarchical Bayes *
  14.4 Multiple binomial variables
  14.5 Multiple Gaussian variables
  14.6 Bayesian fitting of Markov language models
  14.7 Fitting multiple related linear regressions
  
================
III Computation
================
 
15 Unconstrained optimization
  15.1 Introduction
  15.2 Kinds of optimization problems
  15.3 Local vs global optima
  15.4 Computational complexity of finding an optimum *
  15.5 Overview of algorithms
  15.6 Examples of optimization problems
  15.7 First-order methods
  15.8 Second-order methods
  15.9 Comparison of algorithms
  15.10 Mathlab software
  15.11 Gradient-free unconstrained optimization
 
16 Expectation maximization (EM) algorithm
  16.1 Introduction
  16.2 Basic idea
  16.3 Example: computing MLE for MVN from partially observed data
  16.4 Example: MLE for the multivariate Student distribution *
  16.5 EM theory *
  16.6 Bound optimization / MM algorithms
  16.7 EM variants *
  16.8 Convergence rate of EM *
 
17 Deterministic algorithms for approximate inference (Unfinished) 
  17.1 Introduction
  17.2 Laplace approximation
  17.3 Variational Bayes *
  17.4 Bits-back coding
  17.5 Variational free energy
  17.6 The mean field method
  17.7 VB for univariate Gaussian
  17.8 Forward or reverse KL?
  17.9 Software for variational inference
 
18 Stochastic algorithms for approximate inference (Unfinished)
  18.1 Introduction 
  18.2 Generating random samples from a univariate distribution
  18.3 Importance sampling (Unfinished)
  18.4 Markov Chain Monte Carlo (MCMC)
  18.5 Gibbs sampling
  18.6 Metropolis Hastings algorithm
  18.7 Accuracy of MCMC
  18.8 Convergence
  18.9 More advanced MCMC algorithms
 
19 Constrained optimization
  19.1 Introduction
  19.2 Lagrange multipliers and the KKT conditions
  19.3 Examples
  19.4 Primal-dual problems *
  19.5 Overview of algorithms
  19.6 Conjugate duality
 
20 Discrete optimization
  20.1 Introduction
  20.2 Dynamic programming
  20.3 Branch and bound
  20.4 Hill climbing/coordinate ascent
  20.5 Stochastic local search
  20.6 Tabu search
  20.7 Evolutionary algorithms
  20.8 Matlab software
 
============================
IV More supervised learning
============================
 
21 Linear regression
  21.1 Introduction
  21.2 Maximum likelihood and least squared revisited
  21.3 Robust regression *
  21.4 Censored regression *
  21.5 Ridge regression revisited
  21.6 Bayesian linear regression (unknown o^2)
  21.7 Empirical Bayes for linear regression
  21.8 Variational Bayes for linear regression *
  21.9 Kernel smoothers *
  21.10 Semi-parametric regression *
 
22 Generalized linear models
  22.1 Introduction
  22.2 Multinomial logistic regression
  22.3 Bayesian logistic regression
  22.4 Empirical Bayes for logistic regression
  22.5 Variational Bayes for logistic regression
  22.6 Generalized linear models
  22.7 Probit regression *
 
23 Neural networks
  23.1 Introduction
  23.2 Kinds of neural networks
  23.3 Parameter estimation, or, how to train a neural net
  23.4 Implementation *
  23.5 Bayesian neural networks *
 
24 Variable selection
  24.1 Introduction
  24.2 Detecting marginally relevant variables
  24.3 Bayesian variable selection
  24.4 Subset selection
  24.5 l1 regularization
  24.6 Variants on l1 regularization *
  24.7 Non-convex regularization
  24.8 Automatic relevancy determination (ARD) 
 
25 Sparse kernel machines
  25.1 Introduction
  25.2 Kernels
  25.3 The kernel trick
  25.4 Support vector machines
  25.5 Experimental comparison of kernel methods
  25.6 When should you use kernel methods?
 
26 Gaussian processes (Unfinished)
  26.1 Introduction
  26.2 GPs for regression
  26.3 GPs for classification (Unfinished)
 
27 Trees, forests and boosting (Unfinished)
  27.1 Introduction
 
=============================
V More unsupervised learning
=============================
 
28 Mixture models
  28.1 Introduction
  28.2 ML estimation
  28.3 MAP estimation
  28.4 Selecting the number of mixture components
  28.5 VB for mixtures of Gaussians
  28.6 Non-parametric density estimation
  28.7 Using mixture models for classification
 
29 Latent factor models (Unfinished)
  29.1 Introduction
  29.2 Factor analysis
  29.3 Mixture of factor analyzers
  29.4 Principal components analysis (PCA)
  29.5 EPCA
  29.6 Variational EM for discrete PPCA
  29.7 Variational Bayes for PPCA
 
30 Hidden Markov models
  30.1 What are HMMs?
  30.2 The main inferential tasks, and some applications
  30.3 The forwards-backwards algorithm
  30.4 The Viterbi algorithm
  30.5 Forwards filtering, backwards sampling
  30.6 Parameter estimation
  30.7 HMM variants *
 
31 State space models
  31.1 Linear dynamical systems
  31.2 Kalman filtering
  31.3 Kalman smoothing
  31.4 Paramter estimation *
  31.5 Inference in nonlinear, non-Gaussian systems *
  31.6 Particle filtering
 
========================= 
VI More graphical models
========================= 
 
32 Undirected graphical models (Unfinished)
  32.1 Introduction
  32.2 Representing a joint distribution with a DGM
  32.3 Inference
  32.4 Learning (parameter estimation)
  32.5 MRFs
  32.6 CI properties of UGMs
  32.7 Representing the joint distribution of a UGM
  32.8 Conditional random fields
  32.9 Inference
  32.10 Learning (parameter estimation)
 
33 Inference in graphical models (Unfinished)
  33.1 Introduction
  33.2 Message passing (belief propagation)
  33.3 Message passing (belief propagation) on trees
  33.4 Variable elimination
  33.5 Junction tree algorithm
  33.6 Computational hardness of discrete inference
  33.7 Inference in Gaussian models
  33.8 Approximate inference
  33.9 Mean field revisited
  33.10 Loopy belief propagation (Unfinished)
  33.11 Expectation propagation (Unfinished)
 
34 Network structure learning (Unfinished) 
  34.1 Structure learning for trees
  34.2 Structure learning for DAGMs
  34.3 Structure learning for Gaussian UGMs
  34.4 Structure learning for discrete UGMs
  34.5 Dependency networks
 
===========================
VII Supplementary material  
===========================
 
35 Notation
  35.1 General notation
  35.2 Graph theory notation
  35.3 Probability distributions
  35.4 List of commonly used abbreviations
  35.5 Big-O notation
 
36 Non-Bayesian inference (frequentist statistics)
  36.1 Introduction
  36.2 Sampling distribution of estimators
  36.3 Desirable properties of estimators
  36.4 Confidence intervals
  36.5 Hypothesis testing
  36.6 Pathologies of frequentist statistics
  36.7 Why isn't everyone a Bayesian?
 
37 Linear algebra
  37.1 Introduction
  37.2 Basic notation
  37.3 Matrix Multiplication
  37.4 Operations of matrices
  37.5 Properties of matrices
  37.6 Special kinds of matrices
  37.7 Cholesky decomposition
  37.8 QR decomposition
  37.9 Eigenvalue decomposition (EVD)
  37.10 Singular value decomposition (SVD)
 
38 Calculus
  38.1 Single variable calculus
  38.2 Matrix calculus