Machine Learning CAP6610

CAP 6610, Machine Learning, Fall 2020

Place:WEB
Time:MWF 4 (10:40-11:30 p.m.)

Instructor:
Prof. Arunava Banerjee
Office: CSE E336.
E-mail: arunava@cise.ufl.edu.
Phone: 505-1556.
Office hours: Tuesday 2:00 p.m.-4:00 p.m.

TA:
Anik Chattopadhyay
Office: CSE E335.
E-mail: xxx@cise.ufl.edu.
Office hours: Wednesday 2:00 p.m.-3:00 p.m.

TA:
Jingzhou Hu
Office: CSE E335.
E-mail: xxx@cise.ufl.edu.
Office hours: Thursday 2:00 p.m.-3:00 p.m.

Pre-requisites:

The official pre-requisites for this course is COT5615 (Mathematics for Intelligent Systems). Specifically, knowledge of calculus and linear algebra is necessary since we shall be touching on mathematical probability theory. In addition, proficiency in some programming language is a must.

Textbook: Machine Learning: A Probabilistic Perspective, Murphy, ISBN-10: 0262018020.

Reference: Pattern Recognition and Machine Learning, Bishop, ISBN 0-38-731073-8.

Reference: Pattern Classification, 2nd Edition, Duda, Hart and Stork, John Wiley, ISBN 0-471-05669-3.

Tentative list of Topics to be covered

Bayes decision theory
Bayesian learning
Maximum likelihood estimation and Expectation Maximization
Neural networks including deep learning
Kernel methods including Support Vector Machines
Mixture models
Hidden Markov models
Principal Components Analysis
Independent Components Analysis
Monte-Carlo, Markov Chain methods (Gibbs samplers and Metropolis-Hastings)
Performance evaluation: re-substitution, cross-validation, bagging, and boosting

The above list is tentative at this juncture and the set of topics we end up covering might change due to class interest and/or time constraints.

Please return to this page at least once a week to check updates in the table below

Evaluation:

One individual project spanning the semester, individual reports: 25%
Homework assignments (written and programming): 25%
Two midterm exam: 25% each (Time, tbd)
There will be no makeup exams (Exceptions shall be made for those that present appropriate letters from the Dean of Students Office).

The final grade will be on the curve.

Course Policies:

Late assignments: All homework assignments are due before class.
Plagiarism: You are expected to submit your own solutions to the assignments. While the final project and presentation will be done in groups, each member will be required to demonstrate his/her contribution to the work.
Attendance: Their is no official attendance requirement. If you find better use of the time spent sitting thru lectures, please feel free to devote such to any occupation of your liking. However, keep in mind that it is your responsibility to stay abreast of the material presented in class.
Cell Phones: Absolutely no phone calls during class. Please turn off the ringer on your cell phone before coming to class.

Academic Dishonesty: See http://www.dso.ufl.edu/judicial/honestybrochure.htm for Academic Honesty Guidelines. All academic dishonesty cases will be handled through the University of Florida Honor Court procedures as documented by the office of Student Services, P202 Peabody Hall. You may contact them at 392-1261 for a "Student Judicial Process: Guide for Students" pamphlet.

Students with Disabilities: Students requesting classroom accommodation must first register with the Dean of Students Office. The Dean of Students Office will provide documentation to the student who must then provide this documentation to the Instructor when requesting accommodation.

Announcements

HomeWorks

List of Topics covered (recorded classroom+zoom lectures)

Lectures Topic Additional Reading

1, 2, 3

Putative framework via example: NEST thermostat
Supervised, Unsupervised Learning.
Independent variable, covariates, feature vector vs Class label, dependent variable
Continuous versus nominal features

New York Times Article here
New York Times Article on GAN here
Couple of the referenced papers in the article are
here, here, here, here, and here.
Original GAN paper.
Original VAE paper.

4, 5, 6

Putative framework continued
Concept class/ Hypothesis space: What do we fit
Testing on unseen data
Generalization, over-fitting to training data
Noisy features
Core areas: Probability theory, Optimization, Complexity
Started Mathematical Probability theory: Sample space of outcomes, Sigma algebra of events, Countably additive probability function

7, 8, 9

Mathematical Probability theory continued:
supremum, limit supremum, infimum and limit infimum of a countable sequence of events.
Random variables, Probability distribution function

Rick Durrett's book Probability: Theory and examples, can be found here.

10, 11, 12

Indicator RV, Simple RV
Expected value
Conditional probability, Independence of RV
Probability density function

A more intuitive intro to probability theory, with common thms here.

13, 14, 15

Standared density functions: univariate/multi-variate Gaussian, univariate/multi-variate uniform.
The Risk Functional Approach
Risk functional, Loss function, Expected Loss.
Demonstration of Risk Functionals for Classification, 0/1 loss.
Regression = conditional expectation of dependant variable.

16, 17, 18

The Risk Functional Approach continued.
Demonstration of Risk Functionals for Regression; mean squared error
Demonstration of Risk Functionals for Density Estimation.
Jensen's inequality, Kulback Leibler divergence.
Expected risk versus Empirical risk
Empirical Risk Minimization principle
Generalization, over-fitting to training data

Proof of convergence of perceptron learning algorithm can be found here.

19, 20, 21

Intro to Computational neuroscience and the real biological neuron.
McCullough Pitt neuron
Rosenblatt's Perceptron
Mistake bound theorem for the perceptron.
Energy function for perceptron learning and Stochastic Gradient Descent

All you need to know about optimization: video. Well not quite, but Ben Recht's exposition is brilliant.

22, 23, 24

Multi-layer perceptrons and Error back propagation
On-line learning (stochastic gradient descent), epoch, etc.
Deep learning: Convolution networks, Recurrent neural networks

25, 26, 27

Convex functions, Convex sets
Thm: local minima = global minima
Convex optimization: Inequality and Equality constraints
Lagrange Multipliers

28, 29, 30

Constrained optimization; objective, equality and inequality constraints
Lagrange multiplier technique for equality constraints.
Convex optimization problems, the Lagrangian, the Lagrange dual problem.
Primal form of maximal margin classifier
Support Vector Machines: Margin maximization, the constrained optimization problem;
Primal formulation of SVM
Slack variable version of SVM for linearly non-separable data, hinge loss.
Here is a link to the book Convex Optimization by Boyd and Vandenberghe.

31, 32, 33

Convex optimization problems, the Lagrangian, the Lagrange dual problem.
Dual of a convex program
Slater's condition and the duality gap.
Weak and Strong duality
Dual formulation of SVM

34, 35, 36

Kernel Trick
Mercer's condition
markov, Chebychev, Chernoff, Hoeffding's inequality
Proof Sketch for VC bound on generalization error.
Shattering, VC-dimension, margin etc. Here is the paper that proves the VC-dimension for given margin/diameter.
VC bound on generalization error (statement w/o proof)

37, 38, 39

Unsupervised learning; Roadmap for rest of semester
Maximum likelihood and Bayesian parameter estimation
Maximum likelihood principle (ML), Maximum a posteriori (MAP)
Gaussian distribution, 1-D case, Multi-D case, ML estimates for mean and variance
Bias of estimator
Conjugate priors, Bernoulli/Binomial and it conjugate (Beta)
Conjugate Prior for Multinomial is Dirichlet

40, 41, 42

Principal component analysis.
K-Means Clustering.
Mixture of Gaussians and Expectation Maximization.
Here are D'Souza's notes.

Lectures	Topic	Additional Reading
1, 2, 3	Putative framework via example: NEST thermostat Supervised, Unsupervised Learning. Independent variable, covariates, feature vector vs Class label, dependent variable Continuous versus nominal features	New York Times Article here New York Times Article on GAN here Couple of the referenced papers in the article are here, here, here, here, and here. Original GAN paper. Original VAE paper.
4, 5, 6	Putative framework continued Concept class/ Hypothesis space: What do we fit Testing on unseen data Generalization, over-fitting to training data Noisy features Core areas: Probability theory, Optimization, Complexity Started Mathematical Probability theory: Sample space of outcomes, Sigma algebra of events, Countably additive probability function
7, 8, 9	Mathematical Probability theory continued: supremum, limit supremum, infimum and limit infimum of a countable sequence of events. Random variables, Probability distribution function	Rick Durrett's book Probability: Theory and examples, can be found here.
10, 11, 12	Indicator RV, Simple RV Expected value Conditional probability, Independence of RV Probability density function	A more intuitive intro to probability theory, with common thms here.
13, 14, 15	Standared density functions: univariate/multi-variate Gaussian, univariate/multi-variate uniform. The Risk Functional Approach Risk functional, Loss function, Expected Loss. Demonstration of Risk Functionals for Classification, 0/1 loss. Regression = conditional expectation of dependant variable.
16, 17, 18	The Risk Functional Approach continued. Demonstration of Risk Functionals for Regression; mean squared error Demonstration of Risk Functionals for Density Estimation. Jensen's inequality, Kulback Leibler divergence. Expected risk versus Empirical risk Empirical Risk Minimization principle Generalization, over-fitting to training data	Proof of convergence of perceptron learning algorithm can be found here.
19, 20, 21	Intro to Computational neuroscience and the real biological neuron. McCullough Pitt neuron Rosenblatt's Perceptron Mistake bound theorem for the perceptron. Energy function for perceptron learning and Stochastic Gradient Descent	All you need to know about optimization: video. Well not quite, but Ben Recht's exposition is brilliant.
22, 23, 24	Multi-layer perceptrons and Error back propagation On-line learning (stochastic gradient descent), epoch, etc. Deep learning: Convolution networks, Recurrent neural networks
25, 26, 27	Convex functions, Convex sets Thm: local minima = global minima Convex optimization: Inequality and Equality constraints Lagrange Multipliers
28, 29, 30	Constrained optimization; objective, equality and inequality constraints Lagrange multiplier technique for equality constraints. Convex optimization problems, the Lagrangian, the Lagrange dual problem. Primal form of maximal margin classifier Support Vector Machines: Margin maximization, the constrained optimization problem; Primal formulation of SVM Slack variable version of SVM for linearly non-separable data, hinge loss.	Here is a link to the book Convex Optimization by Boyd and Vandenberghe.
31, 32, 33	Convex optimization problems, the Lagrangian, the Lagrange dual problem. Dual of a convex program Slater's condition and the duality gap. Weak and Strong duality Dual formulation of SVM
34, 35, 36	Kernel Trick Mercer's condition markov, Chebychev, Chernoff, Hoeffding's inequality Proof Sketch for VC bound on generalization error. Shattering, VC-dimension, margin etc. Here is the paper that proves the VC-dimension for given margin/diameter. VC bound on generalization error (statement w/o proof)
37, 38, 39	Unsupervised learning; Roadmap for rest of semester Maximum likelihood and Bayesian parameter estimation Maximum likelihood principle (ML), Maximum a posteriori (MAP) Gaussian distribution, 1-D case, Multi-D case, ML estimates for mean and variance Bias of estimator Conjugate priors, Bernoulli/Binomial and it conjugate (Beta) Conjugate Prior for Multinomial is Dirichlet
40, 41, 42	Principal component analysis. K-Means Clustering. Mixture of Gaussians and Expectation Maximization.	Here are D'Souza's notes.