1. Probability language and sample spaces
Probability is the mathematics of uncertainty. A probability model has three parts:
A sample space $\Omega$ of possible outcomes
A collection of events built from subsets of $\Omega$
A probability measure $P$ assigning likelihoods to events
Outcomes, events, and sample spaces
An outcome is one possible result of an experiment.
A sample space $\Omega$ is the set of all outcomes.
An event is a subset of the sample space.
Examples:
Tossing a coin once: $\Omega = \{\text{H}, \text{T}\}$
Rolling a die: $\Omega = \{1,2,3,4,5,6\}$
Drawing a card from a standard deck: $\Omega$ has 52 outcomes
Event operations
If $A$ and $B$ are events:
$A^c$ means "not $A$"
$A \cup B$ means $A$ or $B$ or both
$A \cap B$ means $A$ and $B$
$A \setminus B$ means $A$ occurs and $B$ does not
Disjoint events satisfy:
Equally likely outcomes
If all outcomes in a finite sample space are equally likely, then
This is the classical model used in many introductory problems.
Probability as a number
For any event $A$:
Interpretation:
$P(A)=0$ means impossible in the model
$P(A)=1$ means certain in the model
Values near 0 or 1 mean very unlikely or very likely
2. Counting methods
Counting is often the first step in a probability problem. If you can count outcomes correctly, the probability usually follows.
Product rule
If a process has $m$ choices for one step and $n$ choices for a second step, then the total number of outcomes is
More generally, multiply the number of choices at each step.
Factorial
The factorial of a positive integer $n$ is
with
Permutations
The number of ways to arrange $r$ objects chosen from $n$ distinct objects is
Use permutations when order matters.
Example:
Assigning gold, silver, and bronze to 10 runners
Combinations
The number of ways to choose $r$ objects from $n$ distinct objects is
Use combinations when order does not matter.
Example:
Choosing 3 students from a class of 12
With and without replacement
With replacement: an object can be used again, so counts often stay the same from step to step
Without replacement: counts change after each selection
Common counting patterns
Arrangements with repeated objects
If $n$ objects contain repeats of sizes $n_1,n_2,\dots,n_k$, the number of distinct arrangements is
Stars and bars
The number of nonnegative integer solutions to
is
This is useful for allocation and composition problems.
3. Axioms and basic rules
Probability obeys a few core rules. Most other formulas come from these.
Kolmogorov axioms
For any event $A$:
$P(A) \ge 0$
$P(\Omega) = 1$
If $A_1, A_2, \dots$ are pairwise disjoint, then
Complement rule
This is often the fastest way to compute a probability.
Addition rule
For any two events $A$ and $B$:
If $A$ and $B$ are disjoint:
Inclusion-exclusion
For three events:
This extends to more events with alternating signs.
Union bound
For events $A_1,\dots,A_n$:
The union bound is useful for quick upper bounds.
4. Conditional probability and Bayes' theorem
Conditional probability updates probability after new information is observed.
Conditional probability
If $P(B) > 0$, then
This is the probability of $A$ given that $B$ has occurred.
Rearranging gives the multiplication rule:
and also
Independence
Events $A$ and $B$ are independent if
Equivalent forms:
Independence is a structural statement, not just a numerical coincidence in one example.
Law of total probability
If $B_1,\dots,B_n$ are disjoint, exhaustive events with $P(B_i)>0$, then
This is often used with partitions such as machine types, disease status, or urn colors.
Bayes' theorem
For $P(B)>0$:
If $\{A_i\}$ is a partition, then
This is the standard framework for diagnosis and classification problems.
Diagnostic interpretation
In a test problem:
Prior probability: $P(A)$
Likelihood: $P(B \mid A)$
Evidence: $P(B)$
Posterior probability: $P(A \mid B)$
Bayes' theorem converts a test result into an updated belief about the underlying cause.
5. Random variables
A random variable maps outcomes to numbers.
Discrete random variables
A discrete random variable takes countably many values.
Its probability mass function (pmf) is
Properties:
$p_X(x) \ge 0$
$\sum_x p_X(x) = 1$
The cumulative distribution function (cdf) is
Continuous random variables
A continuous random variable has probability density function (pdf) $f_X(x)$.
Properties:
$f_X(x) \ge 0$
$\int_{-\infty}^{\infty} f_X(x)\,dx = 1$
Probabilities come from areas:
For continuous variables,
CDF for continuous and discrete variables
For any random variable:
For continuous random variables:
where differentiation is valid when the CDF is smooth enough.
Transformations
If $Y = g(X)$, then find the distribution of $Y$ by:
Relating $Y$ to $X$
Solving for the relevant interval in terms of $X$
Integrating or summing over that interval
For monotone transformations of continuous variables, the pdf can often be obtained via the change-of-variables formula.
6. Expectation, variance, and moments
Expectation is the long-run average value of a random variable.
Expected value
For a discrete random variable $X$:
For a continuous random variable:
More generally, for any function $g$:
or
This is the law of the unconscious statistician.
Linearity of expectation
Expectation is linear even when variables are dependent:
For sums:
This is one of the most useful tools in probability.
Variance
Variance measures spread around the mean:
where $\mu = E[X]$.
Equivalent form:
Standard deviation:
Covariance
For random variables $X$ and $Y$:
Equivalent form:
If $X$ and $Y$ are independent, then
The converse is not always true.
Correlation
The correlation coefficient is
It lies between $-1$ and $1$.
Moments
The $k$th raw moment is $E[X^k]$.
The $k$th central moment is $E[(X-\mu)^k]$.
These moments are useful for characterizing distributions.
7. Common discrete distributions
Discrete distributions model counts, categories, and repeated trials.
Bernoulli distribution
Used for one success/failure trial.
If $X \sim \mathrm{Bernoulli}(p)$, then
Mean and variance:
Binomial distribution
Counts successes in $n$ independent Bernoulli trials.
If $X \sim \mathrm{Binomial}(n,p)$, then
for $k=0,1,\dots,n$.
Mean and variance:
Use when:
Fixed number of trials
Two outcomes per trial
Constant success probability
Independence across trials
Interactive visual
Binomial distribution
Adjust the success probability and trial count to see how the distribution of outcomes changes.
Geometric distribution
Models the number of trials until the first success.
If $X \sim \mathrm{Geometric}(p)$ on $\{1,2,\dots\}$, then
Mean and variance:
Memoryless property:
Negative binomial distribution
Counts trials until the $r$th success.
If $X$ is the trial count to obtain $r$ successes:
for $k=r,r+1,\dots$
Poisson distribution
Models the number of events in a fixed interval when events occur independently at average rate $\lambda$.
If $X \sim \mathrm{Poisson}(\lambda)$, then
for $k=0,1,2,\dots$
Mean and variance:
Poisson is a common model for counts of rare events.
Hypergeometric distribution
Used for sampling without replacement from a finite population.
If a population has $N$ items, $K$ successes, and we draw $n$ items without replacement, then
Mean:
Use hypergeometric instead of binomial when draws are dependent because the population is finite.
8. Common continuous distributions
Continuous distributions model measurements and waiting times.
Uniform distribution
If $X \sim \mathrm{Uniform}(a,b)$, then
Mean and variance:
Exponential distribution
Models waiting time to the first Poisson event.
If $X \sim \mathrm{Exponential}(\lambda)$, then
CDF:
Mean and variance:
Memoryless property:
Normal distribution
If $X \sim \mathcal{N}(\mu,\sigma^2)$, then
Standardization:
The normal distribution is central to approximation and inference.
Gamma distribution
Gamma distributions model waiting times to multiple Poisson events.
If $X \sim \mathrm{Gamma}(\alpha,\lambda)$ with shape $\alpha$ and rate $\lambda$, then
Special cases:
Exponential distribution when $\alpha=1$
Chi-square distribution as a gamma special case
Beta distribution
The beta distribution is defined on $[0,1]$ and is often used for probabilities and proportions.
If $X \sim \mathrm{Beta}(\alpha,\beta)$, then
It is especially useful in Bayesian modeling.
9. Joint distributions and dependence
Many problems involve multiple random variables.
Joint distribution
For discrete variables $X$ and $Y$:
For continuous variables:
with probabilities obtained by double integration.
Marginal distributions
From the joint distribution, get the marginal of $X$ by summing or integrating out $Y$:
Discrete:
Continuous:
Conditional distributions
Discrete:
Continuous:
when the denominator is positive.
Independence of random variables
$X$ and $Y$ are independent if their joint distribution factors:
Discrete:
Continuous:
Conditional expectation
Conditional expectation is the mean of a random variable after conditioning on information.
For a discrete conditioning event:
This quantity is itself a function of $y$.
Conditional expectation is useful for:
Regression
Prediction
Decomposition of variance
Covariance matrix
For a vector of random variables, the covariance matrix records pairwise covariances:
The diagonal entries are variances.
10. Law of large numbers and central limit theorem
These limit theorems explain why averages become stable and why the normal distribution appears so often.
Law of large numbers
If $X_1, X_2, \dots$ are i.i.d. with mean $\mu$, then the sample average
converges to $\mu$ as $n$ grows.
Interpretation:
More data usually means a more stable average
The sample mean should approach the true mean
Central limit theorem
If $X_1,\dots,X_n$ are i.i.d. with mean $\mu$ and variance $\sigma^2$, then for large $n$,
Equivalently,
This approximation is the foundation of much of statistical inference.
Standard error
The standard deviation of the sample mean is
This shrinks as $n$ increases.
Normal approximation to binomial
When $n$ is large and $p$ is not too close to 0 or 1, a binomial random variable can often be approximated by a normal distribution:
Use a continuity correction when appropriate.
11. Problem-solving workflow
Probability problems are usually easier if you identify the model before doing algebra.
Step 1: Define the experiment
Ask:
What is random?
What are the possible outcomes?
Is order important?
Are draws independent?
Step 2: Identify the event
Write the event clearly in words first, then in symbols.
Examples:
At least one success
Exactly two defects
The first success occurs on trial 4
Given that the test was positive
Step 3: Choose the right tool
Pick the simplest model:
Counting for finite equally likely spaces
Conditional probability for updated information
Binomial for fixed independent trials
Hypergeometric for sampling without replacement
Poisson for rare-event counts
Exponential for waiting times
Normal approximation for large-sample sums and averages
Step 4: Compute carefully
Useful tactics:
Use the complement when "at least one" is involved
Use partitions when information is grouped into cases
Use symmetry when outcomes are equivalent
Use linearity of expectation for sums of indicators
Step 5: Check the result
Verify:
Probability is between 0 and 1
Units make sense if the variable is continuous
Independence assumptions are justified
The answer matches the event description
Indicator variables
If $I_A$ is the indicator of event $A$, then
and
Indicator variables are a powerful way to count expected numbers of events.
12. Formula sheet
Core rules
Bayes and total probability
Counting
Random variables
Discrete:
Continuous:
Expectation and variance
Discrete:
Continuous:
Variance:
Covariance:
Standard distributions
Bernoulli:
Binomial:
Geometric:
Poisson:
Exponential:
Normal:
Key moments
Bernoulli:
Binomial:
Poisson:
Exponential:
Uniform on $[a,b]$:
Limit theorems
Sample mean:
Standard error:
CLT approximation:
13. Common mistakes to avoid
Confusing permutations with combinations.
Forgetting whether order matters.
Using the binomial model when sampling is without replacement from a small finite population.
Applying conditional probability as if events were independent.
Treating continuous variables like discrete ones, especially by assigning positive probability to a single point.
Forgetting to normalize a pdf or pmf.
Misusing Bayes' theorem by swapping $P(A\mid B)$ and $P(B\mid A)$.
Ignoring the complement rule for "at least one" problems.
Assuming zero covariance implies independence.
Forgetting continuity corrections in normal approximations to discrete counts.
Quick reference for modeling
Use binomial for fixed independent trials with success probability $p$.
Use hypergeometric for sampling without replacement.
Use Poisson for counts of rare events in time or space.
Use exponential for waiting times between Poisson events.
Use normal for sums, averages, and large-sample approximations.
Use Bayes' theorem when information is observed and beliefs must be updated.
Sources
Stewart, Calculus: Early Transcendentals
Lay, Linear Algebra and Its Applications
Rosen, Discrete Mathematics and Its Applications
Boyce and DiPrima, Elementary Differential Equations and Boundary Value Problems
Blitzstein and Hwang, Introduction to Probability