ML Theory: Ace Your Midterm (Weeks 1-8)
A comprehensive guide to the key theoretical concepts of Machine Learning, essential for acing your midterm exam, covering weeks 1 through 8. This resource is structured to align with the course syllabus, offering clear explanations, formulas, and examples to solidify your understanding.
Week 1. Probability Theory Foundations
Delve into the fundamental concepts of probability theory, the bedrock of machine learning. Understanding these concepts is crucial for grasping how algorithms make predictions and decisions. Probability theory provides the framework for quantifying uncertainty and making informed inferences from data.
Concepts:
- Sample space, event space, random events:
- Sample Space: Think of it as the universe of all possible outcomes of an experiment. For example, when tossing a coin, the sample space is {Heads, Tails}.
- Event Space: This is a collection of subsets from the sample space, representing specific outcomes or events you're interested in. For instance, getting an even number when rolling a die is an event, with the event space being {2, 4, 6}.
- Random Events: These are events where the outcome is uncertain, and we can assign probabilities to them.
- Probability function: The probability function, denoted as P(A), assigns a number between 0 and 1 to each event A, representing the likelihood of that event occurring. A probability of 0 means the event is impossible, while a probability of 1 means the event is certain.
- Conditional probability: Conditional probability, denoted as P(A|B), is the probability of event A occurring given that event B has already occurred. It's like narrowing your focus to a specific subset of the sample space.
- Independence and conditional independence:
- Independence: Two events are independent if the occurrence of one doesn't affect the probability of the other. Mathematically, this means P(A and B) = P(A)P(B).
- Conditional Independence: Events can be conditionally independent given another event C. This means that A and B are independent only when we know that C has occurred.
Key Formulas:
- Elementary probability: P(A) = (#favorable cases)/(#all cases). This is your basic definition, perfect for simple scenarios.
- Multiplication rule: P(A and B) = P(A)P(B|A). Use this when you want to find the probability of two events happening together.
- Chain rule for multiple events: Extends the multiplication rule to more than two events, allowing you to break down complex probabilities into smaller, manageable pieces.
- Total probability: P(B) = ∑ P(B|Ai)P(Ai). This is useful when you want to find the probability of an event B by considering all possible ways it can occur, given a set of mutually exclusive and exhaustive events Ai.
- Bayes theorem: P(A|B) = P(B|A)P(A)/P(B). A cornerstone of machine learning, Bayes' Theorem allows you to update your beliefs about an event A given evidence B.
Example:
If tossing a fair die, what's the probability of getting an even number? Sample space: 1,2,3,4,5,6}, favorable cases, P(even) = 3/6 = 0.5. This example illustrates the basic application of probability to a simple scenario.
Week 2. Random Variables and Distributions
Now, let's explore random variables and distributions, which are essential for modeling and analyzing data in machine learning. Understanding these concepts will enable you to represent data in a probabilistic manner and make predictions based on observed patterns. Random variables and their associated distributions are the building blocks for many machine learning algorithms.
Concepts:
- Random variables: Assigns numbers to outcomes (X: # of heads in 3 coin tosses). A random variable is simply a variable whose value is a numerical outcome of a random phenomenon. It bridges the gap between real-world events and mathematical analysis.
- Discrete random variables & PMF:
Discrete random variables can only take on a finite number of values or a countably infinite number of values. Each value has an associated probability, described by the Probability Mass Function (PMF). Examples include:
- Bernoulli distributions: Models the probability of success or failure in a single trial (e.g., flipping a coin).
- Binomial distributions: Models the number of successes in a fixed number of independent trials (e.g., the number of heads in 10 coin flips).
- Categorical distributions: Models the probability of different categories or classes (e.g., the color of a randomly selected ball from a bag).
- Continuous random variables & PDF: Continuous random variables can take on any value within a given range. Their probability is described by the Probability Density Function (PDF). Examples include:
- Uniform distributions: All values within a given range are equally likely.
- Gaussian distributions (Normal distributions): A bell-shaped curve that is ubiquitous in statistics and machine learning. It's often used to model real-valued data.
- Expectation, variance, covariance: These are key measures that describe the characteristics of random variables.
- E[X] = mean of X: The expectation or mean represents the average value of the random variable.
- Var[X] = E[X²]-(E[X])²: The variance measures the spread or dispersion of the random variable around its mean.
- Cov(X,Y) = E[XY]-E[X]E[Y]: The covariance measures how two random variables change together. A positive covariance indicates that they tend to increase or decrease together, while a negative covariance indicates that they tend to move in opposite directions.
Key Results:
- For PMF: ∑x p(x) = 1. For PDF: ∫ p(x) dx = 1. These equations state that the total probability over all possible values of a random variable must equal 1.
- Linearity: E[X+Y] = E[X]+E[Y]; E[aX]=aE[X]. Expectation is a linear operator, which simplifies calculations.
- Independence: If X, Y are independent, Cov(X,Y)=0, Var[X+Y]=Var[X]+Var[Y]. If two random variables are independent, their covariance is zero, and the variance of their sum is the sum of their variances.
Example:
Tossing 3 coins, X = # heads (Binomial), calculate E[X], Var[X]. This example demonstrates how to calculate the expected value and variance of a binomial random variable.
Week 3. Information Theory
Let's move onto information theory, which provides a framework for quantifying and measuring information. In machine learning, information theory is used for feature selection, model evaluation, and understanding the complexity of data. It provides essential tools for building efficient and effective models.
Concepts:
- Entropy H(X): Entropy, denoted as H(X), measures the uncertainty or randomness associated with a random variable. It tells you how much information is needed, on average, to describe the outcome of the random variable. H(X) = -∑ p(x) log₂ p(x)
- Conditional entropy H(X|Y): Conditional entropy measures the uncertainty of a random variable X given that you know the value of another random variable Y. It quantifies the remaining uncertainty about X after observing Y.
- Mutual information IG(X;Y): Mutual information measures the amount of information that two random variables share. It tells you how much knowing one variable reduces the uncertainty about the other.
- Joint entropy H(X,Y): Joint entropy measures the uncertainty of two random variables considered together. It's the amount of information needed to describe both variables simultaneously.
Key Results:
- 0 ≤ H(X) ≤ log₂ n: Entropy is always non-negative and is bounded by the logarithm of the number of possible outcomes.
- IG(X;Y) = H(X)-H(X|Y) ≥ 0: Mutual information is always non-negative, indicating that knowing one variable can never increase the uncertainty about the other.
- H(X,Y) = H(X)+H(Y|X): The joint entropy of two variables is equal to the entropy of one variable plus the conditional entropy of the other variable given the first.
Example:
For a biased coin (P(head)=0.9), H(X)= -(0.9)log₂(0.9)-(0.1)log₂(0.1) ≈ 0.47. This example demonstrates how to calculate the entropy of a biased coin, illustrating that the more skewed the probabilities, the lower the entropy.
Week 4. Decision Trees & ID3 Algorithm
Decision trees are a powerful and intuitive machine learning algorithm used for both classification and regression tasks. The ID3 algorithm is a classic method for constructing decision trees. Understanding how decision trees work is essential for building models that can make accurate predictions based on data.
Concepts:
- Decision Trees: Decision trees are hierarchical structures that split data based on attribute values. Each internal node represents a test on an attribute, each branch represents an outcome of the test, and each leaf node represents a class label (for classification) or a predicted value (for regression).
- ID3 Algorithm: ID3 (Iterative Dichotomiser 3) is a greedy algorithm that builds a decision tree by recursively selecting the best attribute to split the data at each node. The best attribute is chosen based on information gain, which measures how much the entropy of the data is reduced by splitting on that attribute.
- Pruning & Overfitting:
- Pruning: A technique used to reduce the size of a decision tree by removing branches or nodes that do not improve the model's accuracy on unseen data. This helps to prevent overfitting.
- Overfitting: Occurs when a model learns the training data too well, including noise and irrelevant details. This results in poor performance on new, unseen data.
- Handling continuous/multi-valued attributes: Decision trees can handle both continuous and categorical attributes. Continuous attributes are typically handled by creating binary splits based on threshold values. Multi-valued attributes can be split into multiple branches, one for each value.
Example:
Play Tennis dataset (Tom Mitchell): Use weather attributes to decide if tennis will be played, split by humidity/sunny/windy, etc. This is a classic example that demonstrates how decision trees can be used to make predictions based on weather conditions.
Common Exercise Types:
- Calculating IG of each split.
- Implementing ID3 for small datasets.
Week 5-6. Bayesian Classification: Naive Bayes & Joint Bayes
Dive into Bayesian classification, a probabilistic approach to classification that uses Bayes' theorem to predict the class of a given data point. Understanding Naive Bayes and Joint Bayes classifiers is essential for building models that can handle uncertainty and make accurate predictions based on prior knowledge.
Concepts:
- Bayes rule and conditional independence: Bayes' rule provides a way to update our beliefs about an event based on new evidence. It relates the conditional probability of an event A given event B to the conditional probability of B given A. Conditional independence is a key assumption in Naive Bayes, where features are assumed to be independent of each other given the class label.
- Naive Bayes: Naive Bayes is a simple and efficient classifier that assumes that the features are conditionally independent given the class label. Despite its simplicity, Naive Bayes can perform surprisingly well in many real-world applications.
- Joint Bayes: Joint Bayes is a more complex classifier that does not assume feature independence. It estimates the joint probability distribution of all features and the class label. However, Joint Bayes can be computationally expensive and requires a large amount of data to train.
- MAP vs ML hypotheses:
- MAP (Maximum A Posteriori) hypotheses: The hypothesis that maximizes the posterior probability, which is the probability of the hypothesis given the data. MAP incorporates prior knowledge about the hypothesis.
- ML (Maximum Likelihood) hypotheses: The hypothesis that maximizes the likelihood function, which is the probability of the data given the hypothesis. ML does not incorporate prior knowledge.
- Error analysis, sample complexity:
- Error analysis: The process of identifying the types of errors that a classifier makes and understanding the reasons for those errors. This can help to improve the classifier's performance.
- Sample complexity: The amount of data required to train a classifier to achieve a certain level of accuracy. Sample complexity depends on the complexity of the classifier and the difficulty of the classification task.
Example:
Spam filtering with NB: Calculate P(spam|words) assuming all words independently indicate spam. This example demonstrates how Naive Bayes can be used to classify emails as spam or not spam based on the words they contain.
Week 7. Parameter Estimation (MLE & MAP)
Let's explore parameter estimation, which is the process of estimating the parameters of a statistical model based on observed data. Understanding Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) estimation is crucial for building models that can accurately capture the underlying patterns in data.
Concepts:
- Likelihood function: The likelihood function is the probability of observing the data given a particular set of parameters. It measures how well a particular set of parameters explains the observed data.
- MLE: MLE finds the parameter values that maximize the likelihood function. It's the most common method for estimating parameters in statistical models. MLE is a frequentist approach that does not incorporate prior knowledge about the parameters.
- MAP: MAP finds the parameter values that maximize the posterior probability, which is the product of the likelihood function and a prior distribution over the parameters. MAP is a Bayesian approach that incorporates prior knowledge about the parameters.
Example:
Coin toss, estimating P(head): MLE is #heads/total tosses. This example demonstrates how to use MLE to estimate the probability of getting heads in a coin toss based on the observed data.
Exercises, Examples, Practice
Where possible, refer to cited exercises for practice, proving properties, and applying formulas. If you need worked-out examples, please specify problems to solve!
Reference Reading
- [Machine Learning Book by Tom Mitchell]
- [CS229 Probability Theory Reviews]
- [Foundations of Statistical NLP, Manning & Schutze]
If you require a more detailed or specific explanation for any concept or want worked-out examples, please reply here with the topic.
Assignee: RazvanIonita03 Labels: documentation
For further reading on Machine Learning Theory, visit this trusted website.