1. What statistics studies
Statistics is the science of turning data into conclusions under uncertainty.
It has two main goals:
Descriptive statistics summarize what the data show.
Inferential statistics generalize from a sample to a population.
The key distinction is this:
A population is the full set of units of interest.
A sample is the subset actually observed.
Because samples are incomplete, every statistical conclusion carries uncertainty. Good statistical work makes that uncertainty explicit.
Core ideas
Variation is real and expected.
Models are approximations, not truth.
Randomness can be quantified and used.
The same dataset can support different conclusions if the question changes.
2. Data, variables, and measurement scales
Types of variables
| Type | Meaning | Examples |
|---|---|---|
| Categorical | Values are labels or groups | color, major, brand |
| Ordinal | Categories have a natural order | small/medium/large, rankings |
| Quantitative | Values are numerical measurements | height, income, age |
Quantitative variables may be:
Discrete: countable values, such as number of calls
Continuous: any value in an interval, such as time or weight
Measurement scales
| Scale | Has order? | Equal differences meaningful? | True zero? | Example |
|---|---|---|---|---|
| Nominal | No | No | No | blood type |
| Ordinal | Yes | No | No | class rank |
| Interval | Yes | Yes | No | Celsius temperature |
| Ratio | Yes | Yes | Yes | mass, length, income |
The scale determines what computations make sense. For example, averages are meaningful for ratio data but not for nominal categories.
Common data issues
Missing values
Outliers
Measurement error
Selection bias
Confounding
These are not cosmetic issues. They can change the answer.
3. Describing a distribution
A distribution describes how values are spread across a variable.
Center
Two standard measures of center are:
Mean
$$ \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i $$Median
The middle value after sorting
Robust to outliers
Use the mean for roughly symmetric data without strong outliers. Use the median for skewed data or data with outliers.
Spread
Common measures of spread:
Range: $\max(x) - \min(x)$
Variance:
$$ s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2 $$Standard deviation:
$$ s = \sqrt{s^2} $$Interquartile range:
$$ \mathrm{IQR} = Q_3 - Q_1 $$
Variance and standard deviation measure typical distance from the mean. IQR measures the width of the middle 50% and is less sensitive to outliers.
Shape
Important shape features:
Symmetric or skewed
Unimodal, bimodal, or multimodal
Heavy-tailed or light-tailed
Presence of outliers
Z-scores
A z-score tells how many standard deviations a value lies from the mean:
For sample data:
Z-scores are useful for comparing values across different scales.
Five-number summary
Minimum
First quartile $Q_1$
Median
Third quartile $Q_3$
Maximum
This summary underlies the boxplot.
Interactive visual
Dot plot and summary
Move five sample points to see how the mean, median, and spread respond to the data.
4. Probability essentials
Probability models uncertainty mathematically.
Sample spaces and events
A sample space is the set of all possible outcomes.
An event is a subset of the sample space.
For event $A$:
and
where $\Omega$ is the sample space.
Rules
Complement
Addition rule
For any events $A$ and $B$:
If $A$ and $B$ are mutually exclusive, then $P(A \cap B)=0$.
Conditional probability
provided $P(B) > 0$.
Independence
Events $A$ and $B$ are independent if:
equivalently,
Bayes' rule
Bayes' rule reverses conditioning. It is especially important in diagnostics and classification.
5. Common distributions
Bernoulli
A Bernoulli random variable takes values 1 and 0 with:
Mean:
Variance:
Binomial
Counts successes in $n$ independent Bernoulli trials with success probability $p$:
Probability mass function:
Mean and variance:
Normal
The normal distribution is symmetric and bell-shaped:
Standardization:
The normal distribution is central because many sums and averages are approximately normal.
Poisson
Models counts over time, space, or area when events occur independently at a constant average rate:
Mean and variance:
Exponential
Models waiting time between Poisson events:
Memoryless property:
6. Sampling and the central limit theorem
Sampling distributions
A statistic is a random variable because it depends on the sample. Its distribution is called a sampling distribution.
The standard error measures typical sampling variation.
For the sample mean:
If $\sigma$ is unknown, use:
Central limit theorem
If observations are independent and have finite variance, then for sufficiently large $n$ the distribution of the sample mean is approximately normal:
This is one of the most important results in statistics.
Practical implications
Large samples reduce random error.
Skewed populations can still produce approximately normal sample means.
The CLT does not fix bias, bad measurement, or dependence.
Independence and sampling quality
The CLT and many inferential methods rely on:
Random sampling or random assignment
Independence or weak dependence
Reasonable sample size
If sampling is biased, inference can be precise and wrong at the same time.
7. Estimation and confidence intervals
Point estimates
A point estimate is a single best guess for a population parameter.
Examples:
$\bar{x}$ estimates $\mu$
$\hat{p}$ estimates $p$
$s^2$ estimates $\sigma^2$
Confidence intervals
A confidence interval gives a range of plausible values for a parameter.
General form:
For a mean with known $\sigma$:
For a mean with unknown $\sigma$:
For a proportion:
Interpreting confidence intervals
The correct interpretation is about the long-run method, not a single interval:
A 95% confidence method produces intervals that capture the true parameter about 95% of the time in repeated sampling.
It does not mean there is a 95% probability that the parameter lies in a particular computed interval.
Margin of error
Margin of error usually decreases when:
Sample size increases
Variability decreases
Confidence level decreases
8. Hypothesis testing
Hypothesis testing is a framework for deciding whether observed data are consistent with a null model.
Setup
Null hypothesis: $H_0$
Alternative hypothesis: $H_1$ or $H_a$
Test statistic: a number summarizing evidence against $H_0$
P-value: probability, under $H_0$, of data at least as extreme as what was observed
Decision logic
Small p-value: evidence against $H_0$
Large p-value: data are reasonably consistent with $H_0$
Common significance levels:
Reject $H_0$ when $p \le \alpha$.
Error types
| Decision | Reality: $H_0$ true | Reality: $H_0$ false |
|---|---|---|
| Reject $H_0$ | Type I error | Correct rejection |
| Fail to reject $H_0$ | Correct decision | Type II error |
Type I error probability is $\alpha$.
Type II error probability is $\beta$.
Power is $1-\beta$.
Common test statistics
One-sample mean
One-sample proportion
Good testing practice
State hypotheses before seeing the result.
Choose a meaningful effect size, not just significance.
Report confidence intervals alongside p-values.
Distinguish statistical significance from practical importance.
9. Correlation and regression
Correlation
Correlation measures the strength and direction of linear association between two quantitative variables.
Pearson correlation coefficient:
Interpretation:
$r \approx 1$: strong positive linear relationship
$r \approx -1$: strong negative linear relationship
$r \approx 0$: little linear relationship
Correlation does not imply causation.
Simple linear regression
Model:
Estimated line:
Interpretation:
$b_0$ is the predicted response when $x=0$
$b_1$ is the expected change in $y$ for a one-unit increase in $x$
Residuals
Residual:
Residuals reveal:
Nonlinearity
Unequal variance
Outliers
Model misspecification
Coefficient of determination
is the proportion of variability in the response explained by the model.
Regression pitfalls
Extrapolation beyond the observed range
Omitted variables
Influential outliers
Assuming causation from association
10. Nonparametric and categorical methods
Categorical data
When variables are categorical, analysis often uses counts and proportions.
Chi-square goodness-of-fit
Tests whether observed counts differ from expected counts:
Chi-square test of independence
Used for contingency tables to test whether two categorical variables are associated.
Nonparametric ideas
Nonparametric methods make fewer assumptions about the population distribution.
Common examples:
Sign test
Wilcoxon signed-rank test
Mann-Whitney U test
Kruskal-Wallis test
These are useful when data are skewed, ordinal, or have strong outliers.
When they help
Small samples with unknown shape
Ordinal outcomes
Non-normal data where rank-based comparisons are acceptable
11. Practical workflow and pitfalls
A reliable analysis workflow
Define the question precisely.
Identify the population, parameter, and sample.
Inspect data quality and missingness.
Visualize the distribution.
Choose methods matched to the variable type and design.
Check assumptions.
Compute estimates, intervals, or test results.
Interpret in context.
Report limitations.
Common pitfalls
Confusing sample statistics with population parameters
Ignoring selection bias
Treating p-values as effect sizes
Confusing correlation with causation
Overfitting a model to noisy data
Reporting significance without uncertainty
Model-checking checklist
Is the sample representative?
Are observations independent?
Is the variable type matched to the method?
Are outliers influential?
Are assumptions approximately satisfied?
Is the interpretation scientifically meaningful?
12. Formula sheet
Descriptive statistics
Probability
Distributions
Inference
Regression
Problem-solving notes
Start by naming the parameter.
Check whether the method is for a mean, proportion, count, or association.
If the data are skewed, consider a transformation or a robust method.
If the conclusion depends on a model assumption, state that assumption explicitly.
Sources
Stewart, Calculus: Early Transcendentals
Lay, Linear Algebra and Its Applications
Rosen, Discrete Mathematics and Its Applications
Boyce and DiPrima, Elementary Differential Equations and Boundary Value Problems
Blitzstein and Hwang, Introduction to Probability