CS771 Lecture Notes.pdf -

Please enable JavaScript to view the full PDF

Course Logistics and Introduction CS771: Introduction to Machine Learning Piyush Rai 2 Course Logistics ▪ Course Name: Introduction to Machine Learning – CS771 ▪ An introductory course – supposed to be your first intro to the subject ▪ Usually 3 lectures every week in form of videos (hosted on mooKIT) ▪ Think can these as Mon/Wed/Fri lectures in the usual classroom setting ▪ mooKIT URL: https://hello.iitk.ac.in/cs771a/ (CC id and password to be used for login) ▪ An additional discussion session every Monday, 6pm-7pm (via YouTube Live) ▪ All material will be posted on the mooKIT page for the course ▪ Q/A and announcements on Piazza. Please sign up CS771: Intro to ML 3 Course Team Soumya Banerjee Shivam Bansal Dhanajit Brahma soumyab@cse.iitk.ac.in sbansal@cse.iitk.ac.in dhanajit@cse.iitk.ac.in Amit Chandak Neeraj Matiyali Pratik Mazumder amitch@cse.iitk.ac.in neermat@cse.iitk.ac.in pratikm@cse.iitk.ac.in CS771: Intro to ML 4 Course Team Avik Pal Niravkumar Panchal Hemant Sadana avikpal@cse.iitk.ac.in nirav@cse.iitk.ac.in soumyab@cse.iitk.ac.in Rahul Sharma Piyush Rai rsharma@cse.iitk.ac.in piyush@cse.iitk.ac.in CS771: Intro to ML 5 Workload and Grading Policy ▪ 4 homework assignments (theory + programming) worth 50% ▪ Theory part: Derivations/analysis ▪ Programming part: Implement/use ML algos, analysis of results. Must be done in Python (learn if not already familiar) ▪ Must be typeset in LaTeX (learn if not already familiar) ▪ To be submitted via Gradescope (login details will be provided) ▪ Quizzes and exams (mid-sem and end-sem) worth 50% ▪ Will be held online – details later ▪ Exact break-up of individual components will be announced in a few days Python: https://www.geeksforgeeks.org/python-programming-language/ LaTeX: www.sharelatex.com/blog/latex-guides/beginners-tutorial.html www.overleaf.com/learn/latex/Tutorials CS771: Intro to ML 6 Textbook and References ▪ Many excellent texts but none “required”. Some include: ▪ Different books might vary in terms of ▪ Set of topics covered ▪ Flavor (e.g., classical statistics, deep learning, probabilistic/Bayesian, theory) ▪ Terminology and notation (beware of this especially) ▪ We will provide you the reading material from the relevant sources CS771: Intro to ML 7 Course Goals Credit: Rishika Agarwal (EE, graduated 2017) CS771: Intro to ML 8 Course Real Goals.. ▪ Introduction to the foundations of machine learning models and algos ▪ Focus on developing the ability to ▪ Understand the underlying principles behind ML models and algos ▪ Understand how to implement and evaluate them ▪ Understand/develop intuition on choosing the right ML model/algo for your problem ▪ (Hopefully) inspire you to work on and learn more about ML ▪ Not an introduction to popular software frameworks and libraries, such as scikit-learn, PyTorch, Tensorflow, etc ▪ Can explore once you have some understanding of various ML techniques CS771: Intro to ML Introduction to Machine Learning CS771: Intro to ML 10 Machine Learning (ML) ▪ Designing algorithms that ingest data and learn a model of the data ▪ The learned model can be used to ▪ Detect patterns/structures/themes/trends etc. in the data ▪ Make predictions about future data and make decisions ▪ Modern ML algorithms are heavily “data-driven” ▪ No need to pre-define and hard-code all the rules (infeasible/impossible anyway) ▪ The rules are not “static”; can adapt as the ML algo ingests more and more data CS771: Intro to ML 11 ML: From What It Does to How It Does It? ▪ ML enables intelligent systems to be data-driven rather than rule-driven ▪ How: By supplying training data and building statistical models of data ▪ Pictorial illustration of an ML model for binary classification: A Linear Classifier (the statistical model) CS771: Intro to ML 12 ML: From What It Does to How It Does It? ▪ ML enables intelligent systems to be data-driven rather than rule-driven ▪ How: By supplying training data and building statistical models of data ▪ Pictorial illustration of an ML model for binary classification: A Probabilistic Classifier (the statistical model) P(“cat”|image) P(“dog”|image) CS771: Intro to ML 13 Overfitting = Bad ML ▪ Doing perfectly on training data is not good enough ▪ A good ML model must generalize well on unseen (test data) ▪ Simpler models should be preferred over more complex ones! CS771: Intro to ML 14 ML Applications Abound.. Picture courtesy: gizmodo.com,rcdronearena.com,www.wiseyak.com,www.charlesdong.com CS771: Intro to ML 15 Key Enablers for Modern ML ▪ Availability of large amounts of data to train ML models ▪ Increased computing power (e.g., GPUs) CS771: Intro to ML 16 ML: Some Success Stories CS771: Intro to ML 17 ML: Some Success Stories CS771: Intro to ML 18 ML: Some Success Stories Picture courtesy: https://news.microsoft.com/ CS771: Intro to ML 19 ML: Some Success Stories ▪ Automatic Program Correction Example from “Compilation error repair: for the student programs, from the student programs”, Ahmed et al (2018) CS771: Intro to ML 20 ML: Some Success Stories ▪ ML based colorimetry for water quality assessment ▪ Take uncontaminated water sample ▪ Spike it with known concentration of various compounds (e.g., lead, iron, fluoride, etc) ▪ Dip a test strip (one square to measure each compound) in the contaminated water for some time. ▪ Take a picture of the strip using a phone camera to capture how the colors have changed ▪ Train an ML model to predict concentration levels of various compounds based on color levels in the images (work being done at IITK in collaboration with two startups - Earthface Analytics Pvt Ltd and Kritsnam Technologies Pvt Ltd) CS771: Intro to ML 21 Good ML Systems Should be Fair and Unbiased ▪ Good ML should not just be about getting high accuracies ▪ Should also ensure that the ML models are fair and unbiased Criminals? Not Criminals? An image captioning Don’t want a self-driving car Don’t want a predictive policing system should not always that is more likely to hit black system that predicts criminality assume a specific gender in people than white people using facial features examples like the above ▪ A lot of recent focus on Fairness and Transparency of ML systems Picture courtesy: Bhargava and Forsyth (2019), https://www.thestranger.com/, Xiaolin Wu and Xi Zhang, “Automated Inference on Criminality Using Face Images” CS771: Intro to ML 22 Looking Back Before We Start: History of ML - Human-like text generators (GPT-3) CS771: Intro to ML 23 Next Class ▪ Various Flavors of ML problems ▪ Data and features ▪ Basic mathematical operations on data and features CS771: Intro to ML Warming-up to Machine Learning, Data and Features CS771: Introduction to Machine Learning Piyush Rai 2 Plan for today ▪ Types of ML problems ▪ Typical workflow of ML problems ▪ Various perspectives of ML problems ▪ Data and Features ▪ Some basic operations of data and features CS771: Intro to ML 3 Keep in mind: ML is like an exam ▪ It’s the performance on the D-day which matters ▪ In an exam, our success is measured based on how well we did on the questions in the test (not on the questions we practiced on) ▪ Likewise, in ML, success of the learned model is measured based on how well it predicts/fits the future test data (not the training data) Plus, of course, issues such as fairness In Machine Learning, generalization performance on the test data matters CS771: Intro to ML “Labeled” means, 4 A Loose Taxonomy of ML during training, for each input, the corresponding Learning using Learning using output is available (i.e., the machine labeled data unlabeled data learner is explicitly told that a cat image Some examples of supervised learning problems is of a cat) ▪ Classification Supervised Unsupervised Some examples of ▪ Regression Learning unsupervised learning problems Learning ▪ Ranking ▪ Clustering ▪ Dimensionality Reduction Machine ▪ Unsupervised Probability Density Estimation Learning Many other specialized flavors of ML also exist, some of which include ▪ Semi-supervised Learning ▪ Active Learning ▪ Transfer Learning RL doesn’t use “labeled” or ▪ Multitask Learning “unlabeled” data in the traditional Reinforcement ▪ Imitation Learning (somewhat related to RL) sense! In RL, an agent learns via Learning ▪ Zero-Shot Learning its interactions with an environment ▪ Few-Shot Learning ▪ Continual learning CS771: Intro to ML 5 A Typical Supervised Learning Workflow Note: This example is for the problem of binary classification, a supervised learning problem Labeled “dog” “dog” Training “dog” “dog” Data “dog” Is feature extraction done “manually” as a “dog” “cat” pre-processing step before the ML algo “Feature” “cat” Extraction “cat” ML Algorithm starts working? Can’t we “automate” this “cat” “cat” (outputs a “model”) part? Can’t we “learn” good features directly from raw inputs? “cat” Feature extraction converts raw inputs to a numeric representation that the ML algo can understand and work with. More on feature extraction later. Predicted Label Test “Feature” (cat/dog) Indeed. Deep Learning algos Extraction Image do precisely that! Cat vs Dog (feature + model learning). Prediction model More on Deep Learning later. https://www.pinclipart.com/, http://www.pngtree.com CS771: Intro to ML 6 A Typical Unsupervised Learning Workflow Note: This example is for the problem of data clustering, an unsupervised learning problem Unlabeled Data Yes. In this example, given a new “Feature” “test” cat/dog image, we can assign Extraction ML Algorithm it to the cluster with closer centroid (outputs a clustering) Does unsupervised learning also have a test phase? That is, can we also predict the cluster of a new test input? https://www.pinclipart.com/, http://www.pngtree.com CS771: Intro to ML 7 A Typical Reinforcement Learning Workflow Wish to teach an agent optimal policy for some task Agent State Agent does the following repeatedly ▪ Senses/observes the environment ▪ Takes an action based on its current policy ▪ Receives a reward for that action Observation Reward Action ▪ Updates its policy Agent’s goal is to maximize its overall reward There IS supervision, not explicit (as in Supervised Learning) but rather implicit (feedback based) Environment State at time t CS771: Intro to ML 8 ML: Some Perspectives CS771: Intro to ML 9 Geometric Perspective Recall that feature extraction converts inputs into a numeric representation ▪ Basic fact: Inputs in ML problems can often be represented as points or vectors in some vector space ▪ Doing ML on such data can thus be seen from a geometric view y: Grumpiness (scale of 0-100) Regression: A supervised learning problem. Goal is to model the relationship between input (x) and real-valued output (y). This is akin to a line or curve fitting problem x: sleep hours Classification: A supervised learning problem. Goal is to learn a to predict which of the two or more classes an input belongs to. Akin to learning linear/nonlinear separator for the inputs Pic from: https://learningstatisticswithr.com/book/regression.html, https://maxstat.de/ CS771: Intro to ML 10 Geometric Perspective Clustering looks like Clustering: An unsupervised learning classification to me. Is problem. Goal is to group inputs in a there any difference? few clusters based on their similarities Yes. In clustering, we don’t know with each other the labels. Goal is to separate them without any labeled “supervision” Dimensionality Reduction: An unsupervised learning problem. Goal is to compress the size of each input without losing much information present in the data CS771: Intro to ML 11 Perspective as function approximation ▪ Supervised Learning (“predict output given input”) can be usually thought of as learning a function f that maps each input to the corresponding output ▪ Unsupervised Learning (“model/compress inputs”) can also be usually thought of as learning a function f that maps each input to a compact representation Harder since we don’t know the labels in this case ▪ Reinforcement Learning can also be seen as doing function approximation CS771: Intro to ML 12 Perspective as probability estimation ▪ Supervised Learning (“predict output given input”) can be thought of as estimating the conditional probability of each possible output given an input p(label=“cat” | image) ▪ Unsupervised Learning (“model/compress inputs”) can be thought of as estimating the probability density of the inputs Don’t worry if this doesn’t make much sense as of now ☺ But the Harder since we basic idea is to learn the underlying don’t know the data distribution using the labels in this case unlabeled inputs; many ways to do this as we will see later ▪ Reinforcement Learning can also be seen as estimating probability densities CS771: Intro to ML 13 Data and Features CS771: Intro to ML 14 Data and Features Features represent semantics of the inputs. Being able to extract good features is key to the success of ML algos ▪ ML algos require a numeric feature representation of the inputs ▪ Features can be obtained using one of the two approaches ▪ Approach 1: Extracting/constructing features manually from raw inputs ▪ Approach 2: Learning the features from raw inputs ▪ Approach 1 is what we will focus on primarily for now ▪ Approach 2 is what is followed in Deep Learning algorithms (will see later) ▪ Approach 1 is not as powerful as Approach 2 but still used widely CS771: Intro to ML 15 Example: Feature Extraction for Text Data ▪ Consider some text data consisting of the following sentences: ▪ John likes to watch movies BoW is just one of the many ways of doing feature extraction for text data. Not the ▪ Mary likes movies too most optimal one, and has various flaws (can you think of some?), but often works ▪ John also likes football reasonably well ▪ Want to construct a feature representation for these sentences ▪ Here is a “bag-of-words” (BoW) feature representation of these sentences ▪ Each sentence is now represented as a binary vector (each feature is a binary value, denoting presence or absence of a word). BoW is also called “unigram” rep. CS771: Intro to ML 16 Example: Feature Extraction for Image Data ▪ A very simple feature extraction approach for image data is flattening Flattening and histogram based methods destroy the spatial information in the image but often 7x7 image Vector of pixel still work reasonably well (49 pixels) intensities ▪ Histogram of visual patterns is another popular feature extr. method for images ▪ Many other manual feature extraction techniques developed in computer vision and image processing communities (SIFT, HoG, and others) Pic credit: cat.uab.cat/Research/object-recognition CS771: Intro to ML 17 Feature Selection ▪ Not all the extracted features may be relevant for learning the model (some may even confuse the learner) ▪ Feature selection (a step after feature extraction) can be used to identify the features that matter, and discard the others, for more effective learning Age Calculating BMI from this Gender data doesn’t require ML Height Body-mass index (BMI) but this simple example is just to illustrate the idea Weight of feature selection ☺ Eye color ▪ Many techniques exist – some based on intuition, some based on algorithmic principles (will visit feature selection later) ▪ More common in supervised learning but can also be done for unsup. learning CS771: Intro to ML