| Type | Meaning | Examples |
|---|---|---|
| Discrete | Counted — non-negative integers | Number of passengers, calls per day |
| Continuous | Measured — can take decimals | Height, weight, time |
| Nominal categorical | Categories with no natural order | Political party, eye colour |
| Ordinal categorical | Categories with a natural order | Dissatisfied / indifferent / satisfied |
Key distinction: measurable variables have a recognised measurement method; categorical variables classify observations into groups.
| Relationship | Skew type | Tail direction |
|---|---|---|
| Mean > median | Positive / right skew | Long tail to the right |
| Mean < median | Negative / left skew | Long tail to the left |
| Mean = median | Symmetric | Balanced |
Why n−1? Dividing by n−1 instead of n makes s² an unbiased estimator of σ². It corrects for the fact that sample deviations are measured around the sample mean, not the true population mean.
Outlier rule (boxplot): a value is an outlier if it lies more than 1.5×IQR below Q₁ or above Q₃.
When class widths are unequal, use frequency density on the y-axis:
A wider class that looks tall on a raw-frequency y-axis doesn't contain more data — it's just wider. Frequency density corrects for this so area (not height) represents frequency.
| Class | Width | Frequency | Freq. density |
|---|---|---|---|
| [300, 360) | 60 | 6 | 6/60 = 0.100 |
| [360, 380) | 20 | 14 | 14/20 = 0.700 |
| Feature | What it shows |
|---|---|
| Middle line | Median |
| Bottom of box | Q₁ (lower quartile) |
| Top of box | Q₃ (upper quartile) |
| Box length | IQR |
| Whiskers | Furthest non-outlier values |
| Separate points | Outliers (>1.5×IQR from box) |
Boxplots allow quick visual reading of median, quartiles, IQR, range, outliers, and skewness. If the median is closer to Q₁, the distribution is right-skewed; closer to Q₃ implies left-skewed.
Stem-and-leaf diagrams preserve the raw data while showing the distribution. The stem is usually tens/hundreds; the leaf is units.
Data: 350, 354, 364, 368 → Stem 35 | 0 4 and Stem 36 | 4 8
| Diagram | Best for |
|---|---|
| Dot plot | Small datasets — shows clustering and gaps clearly |
| Histogram | Frequency distribution of discrete or continuous data |
| Stem-and-leaf | Preserving raw data while showing distribution |
| Boxplot | Comparing distributions; spotting outliers and skewness |
A good diagram is clear, honest, highlights patterns, and allows the reader to extract information quickly. A bad diagram confuses or misleads.
A dataset can be unimodal (one mode), bimodal (two modes), or multimodal (several modes).
An experiment is a process with uncertain outcomes. The sample space S is the set of all possible outcomes. An event is any subset of S.
| Experiment | Sample space S |
|---|---|
| Toss a coin | {H, T} |
| Roll a die | {1, 2, 3, 4, 5, 6} |
Probability rules: \(0 \le P(A) \le 1\). If \(P(A) = 0\) the event is impossible; if \(P(A) = 1\) it is certain.
where n = favourable outcomes, N = total equally likely outcomes.
| Symbol | Meaning |
|---|---|
| \(A \cup B\) | A or B (union) |
| \(A \cap B\) | A and B (intersection) |
| \(A^c\) | Not A (complement) |
| \(A \mid B\) | A given B (conditional) |
These are NOT the same. Mutually exclusive events with positive probability are actually dependent — knowing one occurred tells you the other didn't. If A and B are mutually exclusive and P(A) > 0 and P(B) > 0, then they CANNOT be independent.
Arranging 4 people in a queue from 6: \({}^6P_4 = 6\times5\times4\times3 = 360\)
Choosing 4 people from 6: \({}^6C_4 = \frac{6!}{4!\,2!} = 15\)
Suppose a disease has prevalence 1% (P(D) = 0.01). A test is 99% sensitive (P(+|D) = 0.99) and 95% specific (P(−|D^c) = 0.95, so P(+|D^c) = 0.05).
Despite the highly accurate test, a positive result only gives a ~17% probability of disease because the disease is rare (low base rate). False positives dominate when the disease prevalence is low. This is the classic base-rate fallacy.
For any partition B₁, B₂, …, Bₙ of S (mutually exclusive and collectively exhaustive):
This is most naturally read from a tree diagram: multiply probabilities along branches, then add across branches that lead to A.
| Distribution | Mean | Variance | P(X = x) |
|---|---|---|---|
| Discrete Uniform(k) | \(\dfrac{k+1}{2}\) | — | \(\dfrac{1}{k}\) |
| Bernoulli(π) | \(\pi\) | \(\pi(1-\pi)\) | \(\pi^x(1-\pi)^{1-x}\) |
| Binomial(n, π) | \(n\pi\) | \(n\pi(1-\pi)\) | \(\binom{n}{x}\pi^x(1-\pi)^{n-x}\) |
| Poisson(λ) | \(\lambda\) | \(\lambda\) | \(\dfrac{e^{-\lambda}\lambda^x}{x!}\) |
Poisson key feature: mean = variance = λ. This is the unique identifying characteristic of a Poisson distribution.
X ~ Bin(n, π) applies when ALL four conditions hold:
| Condition | Meaning |
|---|---|
| Two outcomes | Success or failure only |
| Fixed probability | Same π every trial |
| Fixed number of trials | n is known before the experiment |
| Independent trials | Each trial does not affect the next |
A binomial random variable has n+1 possible values (0 through n), not n.
Models the number of random events in a continuous medium (time, area, distance, volume).
| Context | λ |
|---|---|
| Average 3.2 breakdowns per week; question about 1 week | λ = 3.2 |
| Same; question about 2 weeks | λ = 6.4 |
| Same; question about half a week | λ = 1.6 |
| Wording | Mathematical form |
|---|---|
| Exactly x | \(P(X = x)\) |
| At most x | \(P(X \le x) = F(x)\) |
| Fewer than x | \(P(X < x) = P(X \le x-1) = F(x-1)\) |
| More than x | \(P(X > x) = 1 - F(x)\) |
| At least x | \(P(X \ge x) = 1 - F(x-1)\) |
Most common error: "at least x" is \(1 - F(x-1)\), NOT \(1 - F(x)\). The "at least" boundary is included.
If X ~ Bin(n, π), approximate with X ~ Pois(λ) where λ = nπ when:
n = 100, π = 0.02:
Poisson is simpler here because you don't need to compute \(\binom{100}{x}\) for each x.
The expected value is the probability-weighted average — NOT a simple average of all possible values, since unlikely values get less weight.
| Scenario | Distribution |
|---|---|
| Fixed n trials, two outcomes, fixed π, independence | Binomial(n, π) |
| Random events in continuous medium (time/area), independence | Poisson(λ) |
| Single trial with success/failure | Bernoulli(π) |
| k equally likely outcomes | Discrete Uniform(k) |
| Binomial with n > 30, nπ < 10, small π | Approximate with Poisson(nπ) |
Is there a fixed upper bound on the number of events? If yes → Binomial. If events could theoretically continue indefinitely → Poisson. Example: number of calls in one hour has no fixed upper bound → Poisson.
| Distribution | Mean | Variance |
|---|---|---|
| Uniform U(a, b) | \(\dfrac{a+b}{2}\) | \(\dfrac{(b-a)^2}{12}\) |
| Exponential Exp(λ) | \(\dfrac{1}{\lambda}\) | \(\dfrac{1}{\lambda^2}\) |
| Normal N(μ, σ²) | \(\mu\) | \(\sigma^2\) |
For a continuous random variable:
This is not because the value is impossible — it's because there are infinitely many possible decimal values, so any single point has zero probability mass. Therefore:
Endpoints make no difference. Probability comes from area under the PDF, not height.
The median m satisfies F(m) = 0.5. The mode is where f(x) is maximised.
| Given | To find | Operation |
|---|---|---|
| PDF f(x) | CDF F(x) | Integrate |
| CDF F(x) | PDF f(x) | Differentiate |
The most important distribution in statistics. Bell-shaped, symmetric.
Used for waiting times and times between events in a Poisson process.
If arrivals follow a Poisson process with rate λ events per unit time, then the time between consecutive events follows Exp(λ). The two distributions are naturally paired.
The exponential distribution has no memory: P(X > s+t | X > s) = P(X > t). If a machine has lasted s hours, the probability it lasts another t hours is the same as it was at the start. Past survival gives no information about future survival.
One of the most powerful results in statistics:
Rule of thumb: n ≥ 30 is usually sufficient. This holds regardless of the shape of the original population distribution.
| Feature | Discrete | Continuous |
|---|---|---|
| P(X = x) | Can be positive | Always 0 |
| Probability tool | pmf p(x) | pdf f(x) — density, not probability |
| Interval prob. | Sum p(x) | Integrate f(x) — area under curve |
| CDF shape | Step function (jumps) | Smooth increasing function |
| Endpoint inclusion | Matters (< vs ≤) | Doesn't matter (zero prob at points) |
| Population quantity | Symbol | Sample counterpart | Symbol |
|---|---|---|---|
| Mean | μ | Sample mean | \(\bar{x}\) |
| Variance | σ² | Sample variance | s² |
| Standard deviation | σ | Sample s.d. | s |
| Proportion | π | Sample proportion | p |
The sampling distribution describes how the estimator varies across all possible samples.
Exact if population is normal; approximate for large n by CLT.
Standard error = standard deviation of the estimator. Decreases as n increases — larger samples give more precise estimates.
Critical exam point: divide by σ/√n, NOT by σ. The standard error accounts for averaging over n observations.
Let R = number of successes in n trials, so R ~ Bin(n, π). The sample proportion is P = R/n.
Use π (the true population proportion) in the theoretical variance, not p (the sample proportion). When π is unknown in practice, p is substituted, but always use the correct notation in theoretical derivations.
If data come from N(μ, σ²):
Equivalently, since S_xx = (n−1)S²:
The chi-squared distribution is always non-negative and right-skewed. Different values of degrees of freedom k give different shapes — as k increases, the distribution approaches normality.
n=15, σ²=2, find P(S² > 4):
| Estimator | Standard error | Decreases with |
|---|---|---|
| \(\bar{X}\) | \(\sigma/\sqrt{n}\) | Larger n |
| P | \(\sqrt{\pi(1-\pi)/n}\) | Larger n |
Doubling n reduces the standard error by a factor of √2, not 2. To halve the standard error, you need to quadruple n.
These three results underpin all of Chapters 6–8 (estimation and hypothesis testing). Master them before moving on.
| Bias value | Type | Meaning |
|---|---|---|
| > 0 | Positive bias | Estimator systematically overestimates θ |
| = 0 | Unbiased | Correct on average over repeated samples |
| < 0 | Negative bias | Estimator systematically underestimates θ |
Bias is the systematic component of error. Variance is the random component. MSE captures both.
| Parameter | Unbiased estimator | Result |
|---|---|---|
| μ (mean) | \(\bar{X}\) | \(E(\bar{X}) = \mu\) |
| π (proportion) | P = R/n | E(P) = π |
| σ² (variance) | S² | E(S²) = σ² |
This is why we divide by n−1 in the sample variance formula — it makes S² unbiased for σ². Dividing by n would give a biased (downward) estimator.
The sampling error from one specific sample. Unknown in practice (since θ is unknown), but its distribution is the sampling distribution studied in Chapter 5.
For estimating μ with three estimators from a sample of size n:
| Estimator | Bias | Variance | MSE |
|---|---|---|---|
| T₁ = X̄ | 0 | σ²/n | σ²/n |
| T₂ = (X₁+Xₙ)/2 | 0 | σ²/2 | σ²/2 |
| T₃ = X̄ + 3 | 3 | σ²/n | σ²/n + 9 |
Among all unbiased estimators of θ, the MVUE has the smallest variance.
So among unbiased estimators, smallest variance → smallest MSE → MVUE.
If θ̂₁ and θ̂₂ are both unbiased estimators of θ, and Var(θ̂₁) < Var(θ̂₂), then θ̂₁ is more efficient than θ̂₂.
A relative efficiency > 1 means θ̂₁ is more efficient.
This is a subtle but frequently tested point.
If θ̂ is unbiased for θ, it does NOT follow that θ̂² is unbiased for θ².
Since Var(θ̂) > 0, E(θ̂²) > θ², so θ̂² overestimates θ².
S² is unbiased for σ². But S = √S² is NOT unbiased for σ. The transformation breaks unbiasedness. Similarly, X̄ is unbiased for μ, but 1/X̄ is generally not unbiased for 1/μ.
Think of a target: variance is how scattered the shots are; bias is how far from the bullseye the centre of the shots is. MSE captures both simultaneously.
| Confidence level | z multiplier |
|---|---|
| 90% | 1.645 |
| 95% | 1.96 |
| 99% | 2.576 |
Higher confidence → wider interval. Lower confidence → narrower but less reliable interval.
A 95% CI does NOT mean there is 95% probability that this specific interval contains θ. Once calculated, θ is either in it or not. Correct: if we repeated the sampling many times and built a CI each time, about 95% of those intervals would contain θ.
If no pilot estimate for p: use p = 0.5 (maximises p(1−p) = 0.25, giving the most conservative/largest sample size).
Always round UP — even 96.04 becomes 97. Rounding down would make the margin of error too large.
95% CI, σ = 0.05, error ≤ 0.01:
When σ is unknown, we estimate it with s, and use:
The t distribution is bell-shaped and symmetric like the normal, but has fatter tails. As ν → ∞, t_ν → N(0,1).
| Situation | Degrees of freedom ν |
|---|---|
| Single mean | n − 1 |
| Paired samples | n − 1 |
| Two means, equal variance | n₁ + n₂ − 2 |
| Regression slope/intercept | n − 2 |
Each estimated parameter costs one degree of freedom.
Use paired when the same subject is measured twice (e.g. before/after treatment). Use independent when different subjects are in each group. Pairing reduces variability by removing between-subject differences.
| Situation | Distribution | Why |
|---|---|---|
| σ known | z (standard normal) | Exact result from sampling theory |
| σ unknown, estimate with s | t_{n-1} | Extra uncertainty from estimating σ |
| Large n, σ unknown | z (approximately) | t_ν → N(0,1) as ν → ∞ |
| Two means, equal variances | t_{n1+n2-2} | Pooled variance estimated |
Hypothesis testing chooses between two competing statements about a population parameter.
| H₁ wording | Test type | Critical region |
|---|---|---|
| "Different from" (≠) | Two-tailed | Both tails |
| "Greater than" (>) | Upper-tailed | Right tail only |
| "Less than" (<) | Lower-tailed | Left tail only |
If direction is unspecified, default to two-tailed.
| True state | Decision | Outcome |
|---|---|---|
| H₀ true | Don't reject H₀ | ✓ Correct |
| H₀ true | Reject H₀ | ✗ Type I error (α) |
| H₁ true | Don't reject H₀ | ✗ Type II error (β) |
| H₁ true | Reject H₀ | ✓ Correct (Power = 1−β) |
Decreasing α (making it harder to reject H₀) increases β. There is a trade-off between the two error types.
| α | Critical values |
|---|---|
| 10% | ±1.645 |
| 5% | ±1.96 |
| 1% | ±2.576 |
| α | Critical value |
|---|---|
| 10% | 1.282 |
| 5% | 1.645 |
| 1% | 2.326 |
Mirror of upper-tailed: ×(−1). e.g., 5% lower-tail → −1.645.
| Test type | P-value formula |
|---|---|
| Two-tailed | \(2P(Z \ge |z_{obs}|)\) |
| Upper-tailed | \(P(Z \ge z_{obs})\) |
| Lower-tailed | \(P(Z \le z_{obs})\) |
Reject H₀ if p-value < α. Do not reject if p-value ≥ α.
For testing H₀: π = π₀, the test statistic uses π₀ in the denominator, not p:
Why? Because under H₀, we assume π = π₀ is true, so we use the null value π₀ in the standard error. This differs from the confidence interval, which uses p in the ESE.
Under H₀: π₁ = π₂, both samples estimate the same common proportion π. Estimate it using:
Use this pooled estimate in the test statistic standard error. Do NOT use p₁ and p₂ separately in the denominator when testing H₀: π₁ = π₂.
| Cat. 1 | Cat. 2 | Cat. 3 | Row total | |
|---|---|---|---|---|
| Group A | O₁₁ | O₁₂ | O₁₃ | R₁ |
| Group B | O₂₁ | O₂₂ | O₂₃ | R₂ |
| Col. total | C₁ | C₂ | C₃ | N (grand) |
All expected frequencies E_ij must be ≥ 5 for the chi-squared approximation to be reliable. If some cells have E < 5, merge adjacent categories or use Fisher's exact test.
Under H₀ (independence), observed and expected frequencies should be similar. The test statistic measures the total discrepancy:
Chi-squared values are always ≥ 0 because they square the differences. So the test is always upper-tailed — large values reject H₀.
After rejecting H₀, look at per-cell contributions to χ²:
Cells with large contributions explain the nature of the association. Example: if area A has much higher burglary than expected, burglary is the main problem in area A.
Tests whether observed data follow a specified distribution.
H₀: die is fair → E_i = n/6 for each face. ν = 6 − 1 = 5.
For the association test, ν = (r−1)(c−1). For a 2×2 table: ν = 1. For a 3×3 table: ν = 4. Larger tables need larger χ² to reject H₀.
| Value of r | Interpretation |
|---|---|
| Near +1 | Strong positive linear relationship |
| Near −1 | Strong negative linear relationship |
| Near 0 | Weak / no linear relationship |
In simple linear regression only: \(R^2 = r^2\). This equivalence does NOT hold in multiple regression.
Interpretation: R² = 0.72 means 72% of the variation in y is explained by the linear relationship with x. The remaining 28% is unexplained (random noise).
Uses n−2 degrees of freedom because two parameters (β₀ and β₁) are estimated.
Rejecting H₀ means evidence of a significant linear relationship between x and y.
If the CI contains 0, the slope may not differ significantly from zero.
Prediction becomes less reliable as x₀ moves further from x̄. Prediction variance increases with distance from x̄.
Avoid extrapolation: do not make predictions outside the observed range of x. The linear relationship may not continue, and the regression model was not built to describe those regions.
Standardised residuals above 2 are suspicious. Values above 3 strongly suggest an outlier. Residual plots help detect:
| Feature | Correlation | Regression |
|---|---|---|
| Purpose | Measure strength of linear relationship | Model and predict y from x |
| Symmetric? | Yes (r is same for x,y or y,x) | No (regress y on x ≠ regress x on y) |
| Units? | Dimensionless (−1 to 1) | β₁ in units of y per unit of x |
| Scale invariant? | Yes | No — rescaling x changes β₁ |
Correlation does not imply causation. Ice cream sales and sunscreen sales are positively correlated — both driven by warm weather. A significant slope in regression also does not prove causation.
| Level | z two-tail | z upper |
|---|---|---|
| 10% | ±1.645 | 1.282 |
| 5% | ±1.96 | 1.645 |
| 1% | ±2.576 | 2.326 |
"H₀: π=0.3, H₁: π≠0.3. Use π₀=0.3 in SE (not p). Z=(p−0.3)/√(0.3×0.7/n). α=5% two-tailed → reject if |Z|>1.96. Conclude: at 5% significance there is [sufficient/insufficient] evidence that the true proportion differs from 0.3."