Home
LSE · ST107 · Probability & Statistics

ST107 Revision

10 chapters covered Descriptive → Regression 0 / 10 done
01

Data Visualisation & Descriptive Statistics

Types of variables; histograms; stem-and-leaf; dot plots; boxplots; measures of location; measures of dispersion; skewness; variance and standard deviation.
Types of variables
TypeMeaningExamples
DiscreteCounted — non-negative integersNumber of passengers, calls per day
ContinuousMeasured — can take decimalsHeight, weight, time
Nominal categoricalCategories with no natural orderPolitical party, eye colour
Ordinal categoricalCategories with a natural orderDissatisfied / indifferent / satisfied

Key distinction: measurable variables have a recognised measurement method; categorical variables classify observations into groups.

Measures of location
\[\bar{x} = \frac{\sum_{i=1}^n x_i}{n} \quad \text{(sample mean)}\]
\[\bar{x} = \frac{\sum f_k x_k}{\sum f_k} \quad \text{(frequency data)}\]

Median

\[\text{Odd } n: \quad x_{\left(\frac{n+1}{2}\right)} \qquad \text{Even } n: \quad \frac{x_{(n/2)} + x_{(n/2+1)}}{2}\]

Grouped-data median (interpolation)

\[\text{Median} = \text{lower bound} + \text{class width} \times \frac{\text{remaining observations needed}}{\text{class frequency}}\]

Skewness from location measures

RelationshipSkew typeTail direction
Mean > medianPositive / right skewLong tail to the right
Mean < medianNegative / left skewLong tail to the left
Mean = medianSymmetricBalanced
Measures of dispersion
\[\text{Range} = x_{(n)} - x_{(1)}\]
\[\text{IQR} = Q_3 - Q_1\]
\[S_{xx} = \sum(x_i - \bar{x})^2 = \sum x_i^2 - n\bar{x}^2\]
\[s^2 = \frac{S_{xx}}{n-1} \quad \text{(sample variance)}\]
\[s = \sqrt{s^2} \quad \text{(sample standard deviation)}\]

Why n−1? Dividing by n−1 instead of n makes s² an unbiased estimator of σ². It corrects for the fact that sample deviations are measured around the sample mean, not the true population mean.

Outlier rule (boxplot): a value is an outlier if it lies more than 1.5×IQR below Q₁ or above Q₃.

Histograms — area represents frequency

When class widths are unequal, use frequency density on the y-axis:

\[\text{Frequency density} = \frac{\text{Frequency}}{\text{Class width}}\]

Why this matters

A wider class that looks tall on a raw-frequency y-axis doesn't contain more data — it's just wider. Frequency density corrects for this so area (not height) represents frequency.

Example

ClassWidthFrequencyFreq. density
[300, 360)6066/60 = 0.100
[360, 380)201414/20 = 0.700

Histogram exam checklist

  • Informative title stating what the data are
  • x-axis label showing the variable
  • y-axis labelled "frequency density" (when widths unequal)
  • Sensible class boundaries (around 6–7 bins is a good guide)
  • Correct frequency densities calculated
  • Bars match intervals exactly — no gaps
Boxplots — five-number summary
FeatureWhat it shows
Middle lineMedian
Bottom of boxQ₁ (lower quartile)
Top of boxQ₃ (upper quartile)
Box lengthIQR
WhiskersFurthest non-outlier values
Separate pointsOutliers (>1.5×IQR from box)

Boxplots allow quick visual reading of median, quartiles, IQR, range, outliers, and skewness. If the median is closer to Q₁, the distribution is right-skewed; closer to Q₃ implies left-skewed.

Stem-and-leaf diagrams

Stem-and-leaf diagrams preserve the raw data while showing the distribution. The stem is usually tens/hundreds; the leaf is units.

Exam requirements

  • Informative title
  • Stem and leaf labels (include units)
  • Sensible stems — not too broad, not too narrow
  • Vertical alignment for readability
  • Leaves in ascending order
  • Every observation included exactly once

Example

Data: 350, 354, 364, 368 → Stem 35 | 0 4 and Stem 36 | 4 8

Choosing between diagrams
DiagramBest for
Dot plotSmall datasets — shows clustering and gaps clearly
HistogramFrequency distribution of discrete or continuous data
Stem-and-leafPreserving raw data while showing distribution
BoxplotComparing distributions; spotting outliers and skewness

A good diagram is clear, honest, highlights patterns, and allows the reader to extract information quickly. A bad diagram confuses or misleads.

Common exam traps — Ch.1
  • Histogram y-axis: if class widths are unequal, you MUST use frequency density. Raw frequency on the y-axis is wrong.
  • Mean vs median for skewness: mean > median → positive (right) skew. Mean < median → negative (left) skew.
  • Mode: modal value = most frequent single observation; modal class = class interval with highest frequency.
  • Range vs IQR: range is highly sensitive to outliers; IQR captures the middle 50% and is more robust.
  • Variance units: variance is in squared units (e.g., kg²). Standard deviation is in original units (kg). Always use s.d. for interpretation.
Mode clarification
Modal value
The single most frequently occurring raw observation. Can only be identified exactly from ungrouped data.
Modal class
The class interval with the highest frequency. Used for grouped data. Not the exact mode.

A dataset can be unimodal (one mode), bimodal (two modes), or multimodal (several modes).

02

Probability Theory

Sample spaces; axioms; combinatorics; additive law; independence; conditional probability; Bayes' theorem; total probability.
Foundations: experiments, sample spaces, events

An experiment is a process with uncertain outcomes. The sample space S is the set of all possible outcomes. An event is any subset of S.

ExperimentSample space S
Toss a coin{H, T}
Roll a die{1, 2, 3, 4, 5, 6}

Probability rules: \(0 \le P(A) \le 1\). If \(P(A) = 0\) the event is impossible; if \(P(A) = 1\) it is certain.

Equally likely outcomes

\[P(A) = \frac{n}{N}\]

where n = favourable outcomes, N = total equally likely outcomes.

Key set notation

SymbolMeaning
\(A \cup B\)A or B (union)
\(A \cap B\)A and B (intersection)
\(A^c\)Not A (complement)
\(A \mid B\)A given B (conditional)
Core probability laws

Axioms

\[P(A) \ge 0 \quad P(S) = 1 \quad P\!\left(\bigcup_i A_i\right) = \sum_i P(A_i) \text{ (if mutually exclusive)}\]

Additive law

\[P(A \cup B) = P(A) + P(B) - P(A \cap B)\]

Complement

\[P(A^c) = 1 - P(A)\]

Independence

\[P(A \cap B) = P(A)\,P(B)\]

Conditional probability

\[P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0\]

Bayes' formula

\[P(A \mid B) = \frac{P(B \mid A)\,P(A)}{P(B)}\]

Total probability

\[P(A) = P(A \mid B)\,P(B) + P(A \mid B^c)\,P(B^c)\]
Mutually exclusive vs independent — critical distinction
Mutually exclusive
Events cannot both occur: \(P(A \cap B) = 0\). If A happens, B cannot happen in the same trial. Example: rolling an even number AND an odd number on one die.
Independent
One event gives no information about the other: \(P(A|B) = P(A)\). Example: rolling two separate dice — first die doesn't affect second.

These are NOT the same. Mutually exclusive events with positive probability are actually dependent — knowing one occurred tells you the other didn't. If A and B are mutually exclusive and P(A) > 0 and P(B) > 0, then they CANNOT be independent.

Combinatorics — permutations and combinations
\[n! = n(n-1)(n-2)\cdots 1, \quad 0! = 1\]
\[{}^n P_r = \frac{n!}{(n-r)!} \quad \text{(permutations — ORDER matters)}\]
\[{}^n C_r = \binom{n}{r} = \frac{n!}{r!(n-r)!} \quad \text{(combinations — ORDER does NOT matter)}\]

Decision guide

Use permutations when:
"arrange", "queue", "rank", "order", "first/second/third place" — the position matters
Use combinations when:
"choose", "select", "committee", "group", "hand of cards" — only who is selected matters

Examples

Arranging 4 people in a queue from 6: \({}^6P_4 = 6\times5\times4\times3 = 360\)

Choosing 4 people from 6: \({}^6C_4 = \frac{6!}{4!\,2!} = 15\)

Bayes' theorem — disease testing example

Suppose a disease has prevalence 1% (P(D) = 0.01). A test is 99% sensitive (P(+|D) = 0.99) and 95% specific (P(−|D^c) = 0.95, so P(+|D^c) = 0.05).

\[P(+) = P(+|D)P(D) + P(+|D^c)P(D^c)\]
\[= 0.99(0.01) + 0.05(0.99) = 0.0099 + 0.0495 = 0.0594\]
\[P(D|+) = \frac{P(+|D)\,P(D)}{P(+)} = \frac{0.0099}{0.0594} \approx 16.7\%\]

Despite the highly accurate test, a positive result only gives a ~17% probability of disease because the disease is rare (low base rate). False positives dominate when the disease prevalence is low. This is the classic base-rate fallacy.

Conditional probability — exam methodology
  • 1
    Draw a tree diagram or Venn diagram to visualise the probability space.
  • 2
    Identify the conditioning event (the "given" event) — this becomes your new sample space.
  • 3
    Apply: \(P(A|B) = P(A \cap B) / P(B)\). Denominator is always P(B), NOT P(A).
  • 4
    For Bayes: use total probability to find P(B) first, then apply Bayes formula.
  • 5
    Check: is independence being assumed? If yes, \(P(A \cap B) = P(A)P(B)\) and \(P(A|B) = P(A)\).
Common exam traps — Ch.2
  • Conditional probability denominator: in P(A|B), divide by P(B), not P(A). The conditioning event is in the denominator.
  • Mutually exclusive ≠ independent: if two events with positive probability are mutually exclusive, they cannot be independent.
  • At least one trick: \(P(\text{at least one}) = 1 - P(\text{none})\). Often much easier to compute via the complement.
  • Permutation vs combination: order matters → permutation; order irrelevant → combination. Don't confuse these.
  • Equally likely assumption: P(A) = n/N only works when all outcomes are equally likely. Always check this assumption first.
Total probability and tree diagrams

For any partition B₁, B₂, …, Bₙ of S (mutually exclusive and collectively exhaustive):

\[P(A) = \sum_{i=1}^n P(A \mid B_i)\,P(B_i)\]

This is most naturally read from a tree diagram: multiply probabilities along branches, then add across branches that lead to A.

Partition conditions

Mutually exclusive
At most one B_i can occur — no overlap
Collectively exhaustive
At least one B_i must occur — they cover all possibilities
03

Discrete Probability Distributions

Random variables; discrete uniform; Bernoulli; binomial; Poisson; CDF; expected value; variance; Poisson approximation to binomial.
Distributions — at a glance
DistributionMeanVarianceP(X = x)
Discrete Uniform(k)\(\dfrac{k+1}{2}\)\(\dfrac{1}{k}\)
Bernoulli(π)\(\pi\)\(\pi(1-\pi)\)\(\pi^x(1-\pi)^{1-x}\)
Binomial(n, π)\(n\pi\)\(n\pi(1-\pi)\)\(\binom{n}{x}\pi^x(1-\pi)^{n-x}\)
Poisson(λ)\(\lambda\)\(\lambda\)\(\dfrac{e^{-\lambda}\lambda^x}{x!}\)

Poisson key feature: mean = variance = λ. This is the unique identifying characteristic of a Poisson distribution.

Binomial distribution — conditions

X ~ Bin(n, π) applies when ALL four conditions hold:

ConditionMeaning
Two outcomesSuccess or failure only
Fixed probabilitySame π every trial
Fixed number of trialsn is known before the experiment
Independent trialsEach trial does not affect the next
\[P(X = x) = \binom{n}{x}\pi^x(1-\pi)^{n-x}, \quad x = 0, 1, \ldots, n\]

A binomial random variable has n+1 possible values (0 through n), not n.

Poisson distribution

Models the number of random events in a continuous medium (time, area, distance, volume).

\[P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!}, \quad x = 0, 1, 2, \ldots\]

Poisson conditions

  • Events occur randomly in a continuous medium.
  • Each event is equally likely to occur anywhere.
  • Events are independent.
  • λ is the average rate per unit — must match the unit in the question.

Adjusting λ to match the unit

Contextλ
Average 3.2 breakdowns per week; question about 1 weekλ = 3.2
Same; question about 2 weeksλ = 6.4
Same; question about half a weekλ = 1.6
CDF and probability wording
\[F(x) = P(X \le x) = \sum_{k=0}^{x} P(X=k)\]
\[P(X = x) = F(x) - F(x-1)\]

Translating exam wording

WordingMathematical form
Exactly x\(P(X = x)\)
At most x\(P(X \le x) = F(x)\)
Fewer than x\(P(X < x) = P(X \le x-1) = F(x-1)\)
More than x\(P(X > x) = 1 - F(x)\)
At least x\(P(X \ge x) = 1 - F(x-1)\)

Most common error: "at least x" is \(1 - F(x-1)\), NOT \(1 - F(x)\). The "at least" boundary is included.

Poisson approximation to binomial

If X ~ Bin(n, π), approximate with X ~ Pois(λ) where λ = nπ when:

  • n > 30 (many trials)
  • nπ < 10 (expected successes small)
  • π is extreme (very small or very large)
  • x is small relative to n (counting rare events)

Example

n = 100, π = 0.02:

\[\lambda = n\pi = 100 \times 0.02 = 2\]
\[\text{Bin}(100, 0.02) \approx \text{Pois}(2)\]

Poisson is simpler here because you don't need to compute \(\binom{100}{x}\) for each x.

Expected value and variance
\[E(X) = \mu = \sum_{i} x_i p_i\]

The expected value is the probability-weighted average — NOT a simple average of all possible values, since unlikely values get less weight.

\(\bar{x}\)
Sample mean from observed data — a statistic
\(\mu = E(X)\)
Population mean from theoretical distribution — a parameter

Variance shortcut

\[\text{Var}(X) = E(X^2) - [E(X)]^2\]
Choosing the right distribution
ScenarioDistribution
Fixed n trials, two outcomes, fixed π, independenceBinomial(n, π)
Random events in continuous medium (time/area), independencePoisson(λ)
Single trial with success/failureBernoulli(π)
k equally likely outcomesDiscrete Uniform(k)
Binomial with n > 30, nπ < 10, small πApproximate with Poisson(nπ)

Binomial vs Poisson — key question

Is there a fixed upper bound on the number of events? If yes → Binomial. If events could theoretically continue indefinitely → Poisson. Example: number of calls in one hour has no fixed upper bound → Poisson.

Common exam traps — Ch.3
  • "At least" uses x−1: P(X ≥ x) = 1 − P(X ≤ x−1) = 1 − F(x−1).
  • Binomial has n+1 values: X ~ Bin(n, π) can take values 0, 1, …, n. That's n+1 possible values.
  • Poisson units must match: always adjust λ to the time period/area in the question before computing.
  • CDF recovery: P(X = x) = F(x) − F(x−1). Do not forget to subtract F(x−1).
  • Poisson: mean = variance = λ. This is a key identifying feature when asked to justify model choice.
  • Check binomial conditions: if n is not fixed, or π changes, or observations are not independent, binomial does not apply.
04

Continuous Probability Distributions

PDFs and CDFs; continuous uniform; exponential distribution; normal distribution; standard normal; standardisation; central limit theorem.
Core continuous distributions
DistributionMeanVariance
Uniform U(a, b)\(\dfrac{a+b}{2}\)\(\dfrac{(b-a)^2}{12}\)
Exponential Exp(λ)\(\dfrac{1}{\lambda}\)\(\dfrac{1}{\lambda^2}\)
Normal N(μ, σ²)\(\mu\)\(\sigma^2\)

PDFs

\[\text{Uniform: } f(x) = \frac{1}{b-a}, \quad a \le x \le b\]
\[\text{Exponential: } f(x) = \lambda e^{-\lambda x}, \quad x \ge 0\]
\[\text{Exponential CDF: } F(x) = 1 - e^{-\lambda x}\]
\[\text{Uniform CDF: } F(x) = \frac{x-a}{b-a}, \quad a \le x \le b\]
Continuous random variables — the key idea

For a continuous random variable:

\[P(X = x) = 0 \quad \text{for any single value } x\]

This is not because the value is impossible — it's because there are infinitely many possible decimal values, so any single point has zero probability mass. Therefore:

\[P(a < X < b) = P(a \le X < b) = P(a < X \le b) = P(a \le X \le b)\]

Endpoints make no difference. Probability comes from area under the PDF, not height.

Valid PDF conditions

\[f(x) \ge 0 \quad \text{and} \quad \int_S f(x)\,dx = 1\]

Computing probabilities

\[P(a < X < b) = \int_a^b f(x)\,dx = F(b) - F(a)\]
Mean and variance of continuous distributions
\[E(X) = \int_S x\,f(x)\,dx\]
\[\text{Var}(X) = E(X^2) - \mu^2, \quad E(X^2) = \int_S x^2 f(x)\,dx\]

The median m satisfies F(m) = 0.5. The mode is where f(x) is maximised.

PDF ↔ CDF relationship

\[F(x) = \int_{-\infty}^{x} f(t)\,dt \qquad f(x) = F'(x)\]
GivenTo findOperation
PDF f(x)CDF F(x)Integrate
CDF F(x)PDF f(x)Differentiate
Normal distribution

The most important distribution in statistics. Bell-shaped, symmetric.

\[X \sim N(\mu, \sigma^2)\]
  • Symmetric about μ, so mean = median = mode
  • Larger σ² → flatter, more spread out curve
  • No closed-form CDF — use tables for the standard normal Z ~ N(0,1)

Standardisation

\[Z = \frac{X - \mu}{\sigma} \sim N(0,1)\]
\[P(X < a) = P\!\left(Z < \frac{a - \mu}{\sigma}\right)\]

Standardisation exam method

  • 1
    Write down the probability you want: e.g., P(X < 120).
  • 2
    Standardise: Z = (120 − μ) / σ.
  • 3
    Look up Φ(z) from the standard normal table.
  • 4
    Use symmetry if z is negative: P(Z < −z) = 1 − P(Z < z) = 1 − Φ(z).
Exponential distribution — key properties

Used for waiting times and times between events in a Poisson process.

\[P(X < x) = 1 - e^{-\lambda x} \qquad P(X > x) = e^{-\lambda x}\]

Poisson-Exponential link

If arrivals follow a Poisson process with rate λ events per unit time, then the time between consecutive events follows Exp(λ). The two distributions are naturally paired.

Memoryless property

The exponential distribution has no memory: P(X > s+t | X > s) = P(X > t). If a machine has lasted s hours, the probability it lasts another t hours is the same as it was at the start. Past survival gives no information about future survival.

Central Limit Theorem

One of the most powerful results in statistics:

\[\bar{X} \sim N\!\left(\mu, \frac{\sigma^2}{n}\right) \quad \text{approximately, for large } n\]

Rule of thumb: n ≥ 30 is usually sufficient. This holds regardless of the shape of the original population distribution.

Population already normal
X̄ is exactly normal for any n, no matter how small
Population non-normal (e.g. exponential)
X̄ is approximately normal once n ≥ 30. More skewed populations need larger n
Continuous vs discrete — key differences
FeatureDiscreteContinuous
P(X = x)Can be positiveAlways 0
Probability toolpmf p(x)pdf f(x) — density, not probability
Interval prob.Sum p(x)Integrate f(x) — area under curve
CDF shapeStep function (jumps)Smooth increasing function
Endpoint inclusionMatters (< vs ≤)Doesn't matter (zero prob at points)
Common exam traps — Ch.4
  • P(X = x) = 0 for continuous distributions — never write a positive value for exact-point probability.
  • PDF is not a probability. f(x) can exceed 1. Only area under f(x) gives probability.
  • Standardise correctly: for X, Z = (X−μ)/σ. For X̄, Z = (X̄−μ)/(σ/√n). Common error: forgetting the √n.
  • Normal tables: most tables give P(Z ≤ z). For P(Z ≥ z) use 1 − Φ(z). For negative z, use Φ(−z) = 1 − Φ(z).
  • CLT applies to X̄, not to individual X values. A single observation from a skewed distribution is not normal just because n = 30.
05

Sampling Distributions of Statistics

Population vs sample; estimators; standard error; sampling distribution of X̄; sampling distribution of P; chi-squared distribution; sample variance.
Population vs sample — notation
Population quantitySymbolSample counterpartSymbol
MeanμSample mean\(\bar{x}\)
Varianceσ²Sample variance
Standard deviationσSample s.d.s
ProportionπSample proportionp

Estimator vs estimate

Estimator \(\hat{\theta}\)
A random variable (rule/formula). Varies from sample to sample. E.g., \(\bar{X} = \frac{1}{n}\sum X_i\)
Estimate \(\hat{\theta}\)
A specific number from one sample. E.g., if sample gives values 4,8,2,6 then \(\bar{x} = 5\)

The sampling distribution describes how the estimator varies across all possible samples.

Sampling distribution of X̄
\[E(\bar{X}) = \mu \qquad \text{Var}(\bar{X}) = \frac{\sigma^2}{n}\]
\[\bar{X} \sim N\!\left(\mu, \frac{\sigma^2}{n}\right)\]

Exact if population is normal; approximate for large n by CLT.

Standard error

\[\text{S.E.}(\bar{X}) = \frac{\sigma}{\sqrt{n}}\]

Standard error = standard deviation of the estimator. Decreases as n increases — larger samples give more precise estimates.

Standardisation for X̄

\[Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0,1)\]

Critical exam point: divide by σ/√n, NOT by σ. The standard error accounts for averaging over n observations.

Sampling distribution of sample proportion P

Let R = number of successes in n trials, so R ~ Bin(n, π). The sample proportion is P = R/n.

\[E(P) = \pi \qquad \text{Var}(P) = \frac{\pi(1-\pi)}{n}\]
\[P \sim N\!\left(\pi, \frac{\pi(1-\pi)}{n}\right) \quad \text{approximately (large } n\text{)}\]
\[\text{S.E.}(P) = \sqrt{\frac{\pi(1-\pi)}{n}}\]
\[Z = \frac{P - \pi}{\sqrt{\pi(1-\pi)/n}} \sim N(0,1)\]

Use π (the true population proportion) in the theoretical variance, not p (the sample proportion). When π is unknown in practice, p is substituted, but always use the correct notation in theoretical derivations.

Chi-squared distribution and sample variance

If data come from N(μ, σ²):

\[\frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\]

Equivalently, since S_xx = (n−1)S²:

\[\frac{S_{xx}}{\sigma^2} \sim \chi^2_{n-1}\]

Chi-squared properties

\[E(\chi^2_k) = k \qquad \text{Var}(\chi^2_k) = 2k\]

The chi-squared distribution is always non-negative and right-skewed. Different values of degrees of freedom k give different shapes — as k increases, the distribution approaches normality.

Exam methodology for S² questions

  • 1
    Identify n and σ² from the question.
  • 2
    Transform: multiply both sides of the inequality by (n−1)/σ².
  • 3
    The transformed quantity follows χ²_{n-1}.
  • 4
    Look up the chi-squared table with n−1 degrees of freedom.

Example

n=15, σ²=2, find P(S² > 4):

\[P(S^2 > 4) = P\!\left(\chi^2_{14} > \frac{14 \times 4}{2}\right) = P(\chi^2_{14} > 28)\]
Standard error — key comparisons
EstimatorStandard errorDecreases with
\(\bar{X}\)\(\sigma/\sqrt{n}\)Larger n
P\(\sqrt{\pi(1-\pi)/n}\)Larger n

Doubling n reduces the standard error by a factor of √2, not 2. To halve the standard error, you need to quadruple n.

Var(X̄) vs S.E.(X̄)

Var(X̄) = σ²/n
The variance of the estimator. In squared units of the original data.
S.E.(X̄) = σ/√n
Standard deviation of the estimator. In original data units. Used for confidence intervals and standardisation.
Common exam traps — Ch.5
  • Divide by σ/√n, not σ: when standardising X̄, use Z = (X̄ − μ)/(σ/√n). The most common error in this chapter.
  • Var vs S.E.: Var(X̄) = σ²/n (squared units); S.E.(X̄) = σ/√n (original units). Don't mix them up.
  • Use π in theoretical variance of P: Var(P) = π(1−π)/n, not p(1−p)/n in theory.
  • Sample variance uses chi-squared: never use the normal distribution directly for S². Transform to (n−1)S²/σ² ~ χ²_{n-1} first.
  • Upper tail of chi-squared: chi-squared tables often give upper-tail probabilities. Check what your table gives.
Three sampling distributions summarised
\[\bar{X} \sim N\!\left(\mu,\, \frac{\sigma^2}{n}\right)\]
\[P \sim N\!\left(\pi,\, \frac{\pi(1-\pi)}{n}\right)\]
\[\frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\]

These three results underpin all of Chapters 6–8 (estimation and hypothesis testing). Master them before moving on.

06

Point Estimation

Bias; variance of an estimator; Mean Squared Error (MSE); comparing estimators; MVUE; unbiasedness of common estimators.
Bias, variance and MSE
\[\text{Bias}(\hat{\theta}) = E(\hat{\theta}) - \theta\]
\[\text{MSE}(\hat{\theta}) = \text{Var}(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2\]
\[\text{Unbiased: } E(\hat{\theta}) = \theta \implies \text{Bias} = 0 \implies \text{MSE} = \text{Var}(\hat{\theta})\]

Types of bias

Bias valueTypeMeaning
> 0Positive biasEstimator systematically overestimates θ
= 0UnbiasedCorrect on average over repeated samples
< 0Negative biasEstimator systematically underestimates θ

Bias is the systematic component of error. Variance is the random component. MSE captures both.

Standard unbiased estimators
ParameterUnbiased estimatorResult
μ (mean)\(\bar{X}\)\(E(\bar{X}) = \mu\)
π (proportion)P = R/nE(P) = π
σ² (variance)E(S²) = σ²

This is why we divide by n−1 in the sample variance formula — it makes S² unbiased for σ². Dividing by n would give a biased (downward) estimator.

Sampling error

\[\text{Sampling error} = \hat{\theta} - \theta\]

The sampling error from one specific sample. Unknown in practice (since θ is unknown), but its distribution is the sampling distribution studied in Chapter 5.

Comparing estimators — worked example

For estimating μ with three estimators from a sample of size n:

\[T_1 = \bar{X}, \quad T_2 = \frac{X_1 + X_n}{2}, \quad T_3 = \bar{X} + 3\]
EstimatorBiasVarianceMSE
T₁ = X̄0σ²/nσ²/n
T₂ = (X₁+Xₙ)/20σ²/2σ²/2
T₃ = X̄ + 33σ²/nσ²/n + 9
  • T₃ is always worse than T₁ — same variance but adds bias, so MSE is larger.
  • T₁ beats T₂ when n > 2 — T₁ uses all observations; T₂ only uses the first and last.
  • A biased estimator is not automatically worse — compare MSE, not just bias.
MVUE — minimum variance unbiased estimator

Among all unbiased estimators of θ, the MVUE has the smallest variance.

\[\text{For unbiased estimators: MSE} = \text{Var}(\hat{\theta})\]

So among unbiased estimators, smallest variance → smallest MSE → MVUE.

Relative efficiency

If θ̂₁ and θ̂₂ are both unbiased estimators of θ, and Var(θ̂₁) < Var(θ̂₂), then θ̂₁ is more efficient than θ̂₂.

\[\text{Relative efficiency} = \frac{\text{Var}(\hat{\theta}_2)}{\text{Var}(\hat{\theta}_1)}\]

A relative efficiency > 1 means θ̂₁ is more efficient.

Unbiasedness is not invariant under transformations

This is a subtle but frequently tested point.

If θ̂ is unbiased for θ, it does NOT follow that θ̂² is unbiased for θ².

\[E(\hat{\theta}^2) = \text{Var}(\hat{\theta}) + [E(\hat{\theta})]^2 = \text{Var}(\hat{\theta}) + \theta^2 > \theta^2\]

Since Var(θ̂) > 0, E(θ̂²) > θ², so θ̂² overestimates θ².

Practical implication

S² is unbiased for σ². But S = √S² is NOT unbiased for σ. The transformation breaks unbiasedness. Similarly, X̄ is unbiased for μ, but 1/X̄ is generally not unbiased for 1/μ.

Exam methodology — comparing estimators
  • 1
    Find E(T) for each estimator to determine bias: Bias = E(T) − θ.
  • 2
    Find Var(T) for each estimator.
  • 3
    Compute MSE = Var(T) + [Bias(T)]².
  • 4
    Choose the estimator with the smallest MSE.
  • 5
    If both are unbiased, choose the one with smaller variance (= smaller MSE).

Common exam traps

  • A biased estimator can be preferred if its variance is much smaller (lower MSE).
  • Do not confuse "unbiased" with "good" — a high-variance unbiased estimator can be worse than a low-bias biased one.
  • T = X̄ uses all n observations. Estimators using only a few observations (like T₂ above) are generally less efficient.
MSE decomposition — visual intuition
Variance component
How widely the estimator spreads around its expected value. Captured by Var(θ̂). Reduced by larger sample size.
Bias² component
How far the estimator's expected value is from the true θ. Captured by [Bias]². Not reduced by larger n (systematic error).

Think of a target: variance is how scattered the shots are; bias is how far from the bullseye the centre of the shots is. MSE captures both simultaneously.

07

Interval Estimation

Confidence intervals; t-distribution; CI for mean; CI for proportion; sample size determination; difference between two means; difference between two proportions; paired samples.
Confidence interval — structure
\[\hat{\theta} \pm z_{\alpha/2} \times \text{S.E.}(\hat{\theta}) \quad \text{(σ known)}\]
\[\hat{\theta} \pm t_{\alpha/2,\,\nu} \times \text{E.S.E.}(\hat{\theta}) \quad \text{(σ unknown)}\]

Common multipliers

Confidence levelz multiplier
90%1.645
95%1.96
99%2.576

Higher confidence → wider interval. Lower confidence → narrower but less reliable interval.

Correct interpretation

A 95% CI does NOT mean there is 95% probability that this specific interval contains θ. Once calculated, θ is either in it or not. Correct: if we repeated the sampling many times and built a CI each time, about 95% of those intervals would contain θ.

All confidence interval formulas

Single mean, σ known

\[\bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}\]

Single mean, σ unknown (use t_{n-1})

\[\bar{x} \pm t_{\alpha/2,\,n-1} \frac{s}{\sqrt{n}}\]

Single proportion

\[p \pm z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}}\]

Difference in means, σ known

\[(\bar{x}_1 - \bar{x}_2) \pm z_{\alpha/2}\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}\]

Difference in means, equal variance pooled (t_{n1+n2-2})

\[s_p^2 = \frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{n_1+n_2-2}\]
\[(\bar{x}_1-\bar{x}_2) \pm t_{\alpha/2,n_1+n_2-2}\sqrt{s_p^2\!\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}\]

Paired samples (reduce to single mean)

\[d_i = x_i - y_i, \quad \bar{d} \pm t_{\alpha/2,\,n-1}\frac{s_d}{\sqrt{n}}\]
Sample size determination

For estimating a mean

\[e = z_{\alpha/2}\frac{\sigma}{\sqrt{n}} \implies n \ge \frac{z_{\alpha/2}^2\,\sigma^2}{e^2}\]

For estimating a proportion

\[n \ge \frac{z_{\alpha/2}^2\,p(1-p)}{e^2}\]

If no pilot estimate for p: use p = 0.5 (maximises p(1−p) = 0.25, giving the most conservative/largest sample size).

Always round UP — even 96.04 becomes 97. Rounding down would make the margin of error too large.

Example

95% CI, σ = 0.05, error ≤ 0.01:

\[n \ge \frac{1.96^2 \times 0.05^2}{0.01^2} = \frac{0.0038}{0.0001} = 96.04 \implies n = 97\]
Student's t-distribution

When σ is unknown, we estimate it with s, and use:

\[\frac{\hat{\theta} - \theta}{\text{E.S.E.}(\hat{\theta})} \sim t_\nu\]

The t distribution is bell-shaped and symmetric like the normal, but has fatter tails. As ν → ∞, t_ν → N(0,1).

Degrees of freedom

SituationDegrees of freedom ν
Single meann − 1
Paired samplesn − 1
Two means, equal variancen₁ + n₂ − 2
Regression slope/interceptn − 2

Each estimated parameter costs one degree of freedom.

Difference between two proportions
\[\text{E.S.E.}(P_1-P_2) = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}\]
\[(p_1 - p_2) \pm z_{\alpha/2}\sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}\]

Interpreting the interval

Interval contains 0
No significant evidence that π₁ ≠ π₂. The difference is consistent with zero.
Interval excludes 0
Evidence that the two proportions differ significantly.

Paired vs independent samples

Use paired when the same subject is measured twice (e.g. before/after treatment). Use independent when different subjects are in each group. Pairing reduces variability by removing between-subject differences.

z vs t — when to use which
SituationDistributionWhy
σ knownz (standard normal)Exact result from sampling theory
σ unknown, estimate with st_{n-1}Extra uncertainty from estimating σ
Large n, σ unknownz (approximately)t_ν → N(0,1) as ν → ∞
Two means, equal variancest_{n1+n2-2}Pooled variance estimated
Common exam traps — Ch.7
  • Confidence interval interpretation: do NOT say "95% probability that μ is in this interval." The interval is either right or wrong — say "if we repeated this procedure, 95% of intervals constructed would contain μ."
  • Round sample size UP: n = 96.04 requires n = 97, not 96.
  • Use conservative p = 0.5 for sample size when no pilot estimate is available.
  • z vs t: use t when σ is estimated from data. Use z only when σ is known exactly.
  • Degrees of freedom: single mean → n−1; two means (equal variance) → n₁+n₂−2; paired → n−1 (treat as single sample of differences).
  • Wider confidence → larger z multiplier → wider interval. 99% CI is always wider than 95% CI.
08

Hypothesis Testing

H₀ and H₁; Type I and II errors; critical region; p-values; tests for mean; tests for proportion; two-sample tests; paired tests.
Hypothesis testing framework

Hypothesis testing chooses between two competing statements about a population parameter.

H₀ — null hypothesis
Always contains equality (=, ≤, ≥). Represents "no effect", "no difference", "no improvement". Assumed true initially.
H₁ — alternative hypothesis
What you are looking for evidence of. Determines the type of test. Does NOT contain equality.

Types of test

H₁ wordingTest typeCritical region
"Different from" (≠)Two-tailedBoth tails
"Greater than" (>)Upper-tailedRight tail only
"Less than" (<)Lower-tailedLeft tail only

If direction is unspecified, default to two-tailed.

All hypothesis test formulas

Single mean, σ known

\[Z = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}} \sim N(0,1)\]

Single mean, σ unknown

\[T = \frac{\bar{X} - \mu_0}{s/\sqrt{n}} \sim t_{n-1}\]

Single proportion

\[Z = \frac{P - \pi_0}{\sqrt{\pi_0(1-\pi_0)/n}} \sim N(0,1)\]

Two proportions (pooled under H₀)

\[\hat{p} = \frac{r_1+r_2}{n_1+n_2}, \quad Z = \frac{p_1-p_2}{\sqrt{\hat{p}(1-\hat{p})(1/n_1+1/n_2)}}\]

Two means, variances known

\[Z = \frac{(\bar{X}_1-\bar{X}_2)-(\mu_1-\mu_2)}{\sqrt{\sigma_1^2/n_1+\sigma_2^2/n_2}}\]

Paired samples

\[d_i = x_i - y_i, \quad T = \frac{\bar{d}}{s_d/\sqrt{n}} \sim t_{n-1}\]
Type I and Type II errors
True stateDecisionOutcome
H₀ trueDon't reject H₀✓ Correct
H₀ trueReject H₀✗ Type I error (α)
H₁ trueDon't reject H₀✗ Type II error (β)
H₁ trueReject H₀✓ Correct (Power = 1−β)
\[\alpha = P(\text{Type I error}) = P(\text{reject } H_0 \mid H_0 \text{ true})\]
\[\text{Power} = 1 - \beta = P(\text{reject } H_0 \mid H_0 \text{ false})\]

Decreasing α (making it harder to reject H₀) increases β. There is a trade-off between the two error types.

Critical values table

Two-tailed (z_{α/2})

αCritical values
10%±1.645
5%±1.96
1%±2.576

Upper-tailed (z_α)

αCritical value
10%1.282
5%1.645
1%2.326

Lower-tailed

Mirror of upper-tailed: ×(−1). e.g., 5% lower-tail → −1.645.

P-value method

Test typeP-value formula
Two-tailed\(2P(Z \ge |z_{obs}|)\)
Upper-tailed\(P(Z \ge z_{obs})\)
Lower-tailed\(P(Z \le z_{obs})\)

Reject H₀ if p-value < α. Do not reject if p-value ≥ α.

Proportion test — key distinction

For testing H₀: π = π₀, the test statistic uses π₀ in the denominator, not p:

\[Z = \frac{P - \pi_0}{\sqrt{\pi_0(1-\pi_0)/n}}\]

Why? Because under H₀, we assume π = π₀ is true, so we use the null value π₀ in the standard error. This differs from the confidence interval, which uses p in the ESE.

Two-proportion test — pooled proportion

Under H₀: π₁ = π₂, both samples estimate the same common proportion π. Estimate it using:

\[\hat{p} = \frac{r_1 + r_2}{n_1 + n_2}\]

Use this pooled estimate in the test statistic standard error. Do NOT use p₁ and p₂ separately in the denominator when testing H₀: π₁ = π₂.

Hypothesis testing — step-by-step method
  • 1
    State H₀ and H₁ clearly. H₀ always has equality.
  • 2
    Choose significance level α (usually 5%).
  • 3
    Identify the appropriate test statistic and its distribution under H₀.
  • 4
    Calculate the observed test statistic from the data.
  • 5
    Find the critical value(s) or compute the p-value.
  • 6
    Make the decision: reject or fail to reject H₀.
  • 7
    State the conclusion in context — never just say "reject H₀" without interpretation.
Common exam traps — Ch.8
  • Never "accept H₀": the correct phrase is "fail to reject H₀" or "there is insufficient evidence to reject H₀." Failure to reject is not the same as proving H₀ is true.
  • H₀ always has equality. Never write H₀: μ ≠ μ₀. The null is always μ = μ₀ (or ≤ or ≥).
  • Proportion test: use π₀ in SE. Using p in the test statistic denominator is wrong — use π₀.
  • Two-proportion: use pooled p̂. Under H₀: π₁ = π₂, pool both samples for the SE.
  • P-value is not α. The p-value is the probability of the observed (or more extreme) result under H₀. If p < α, reject.
  • Two-tailed p-value = 2 × one-tailed. For a two-tailed test, multiply the one-tail probability by 2.
09

Contingency Tables & Chi-Squared Tests

Association vs correlation; observed and expected frequencies; chi-squared test statistic; goodness-of-fit; degrees of freedom; interpreting results.
Correlation vs association
Correlation
Both variables are measurable (continuous or discrete). Use Pearson's r. Example: height and weight.
Association
Both variables are categorical. Use chi-squared test. Example: hair colour and eye colour. If one variable is measurable, convert it into categories first (e.g., age → age group).

Hypotheses for association test

\[H_0: \text{the two factors are independent (not associated)}\]
\[H_1: \text{the two factors are associated (not independent)}\]
Chi-squared test formulas

Expected frequencies (under H₀)

\[E_{ij} = \frac{\text{row } i \text{ total} \times \text{column } j \text{ total}}{\text{grand total}}\]

Test statistic

\[\chi^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\]

Degrees of freedom

\[\nu = (r-1)(c-1) \quad \text{(association test)}\]
\[\nu = k - 1 \quad \text{(goodness-of-fit, } k \text{ categories)}\]
\[\nu = k - 1 - m \quad \text{(if } m \text{ parameters estimated from data)}\]

Decision rule

\[\text{Reject } H_0 \text{ if } \chi^2 > \chi^2_{\alpha,\nu} \quad \text{(always upper-tailed)}\]
Contingency table — worked structure
Cat. 1Cat. 2Cat. 3Row total
Group AO₁₁O₁₂O₁₃R₁
Group BO₂₁O₂₂O₂₃R₂
Col. totalC₁C₂C₃N (grand)
\[E_{11} = \frac{R_1 \times C_1}{N}\]

Condition for validity

All expected frequencies E_ij must be ≥ 5 for the chi-squared approximation to be reliable. If some cells have E < 5, merge adjacent categories or use Fisher's exact test.

Chi-squared logic — why it works

Under H₀ (independence), observed and expected frequencies should be similar. The test statistic measures the total discrepancy:

  • If O_ij ≈ E_ij for all cells → χ² is small → no evidence against H₀
  • If O_ij differs substantially from E_ij → χ² is large → evidence against H₀

Chi-squared values are always ≥ 0 because they square the differences. So the test is always upper-tailed — large values reject H₀.

Identifying which cells drive the association

After rejecting H₀, look at per-cell contributions to χ²:

\[\text{Cell contribution} = \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\]

Cells with large contributions explain the nature of the association. Example: if area A has much higher burglary than expected, burglary is the main problem in area A.

Goodness-of-fit test

Tests whether observed data follow a specified distribution.

\[H_0: \text{data follow stated distribution}\]
\[H_1: \text{data do not follow stated distribution}\]

Procedure

  • 1
    Specify the hypothesised distribution and calculate expected frequencies E_i = n × P(category i under H₀).
  • 2
    Calculate χ² = Σ(O_i − E_i)² / E_i.
  • 3
    Degrees of freedom = k − 1 − (number of parameters estimated from data).
  • 4
    Compare to χ²_{α,ν} — upper-tail critical value. Reject if χ² exceeds this.

Example: fair die

H₀: die is fair → E_i = n/6 for each face. ν = 6 − 1 = 5.

Common exam traps — Ch.9
  • Chi-squared is always upper-tailed. Large χ² = evidence against H₀. There is no lower-tailed chi-squared test for association.
  • Expected frequencies must be ≥ 5. If not, merge categories before computing the test statistic.
  • Use the correct degrees of freedom: (r−1)(c−1) for association; k−1 for goodness-of-fit (minus any estimated parameters).
  • Correlation uses r, association uses χ². Using Pearson's r for categorical data is wrong.
  • After rejection, interpret which cells drive the result — don't just say "there is association" without identifying where.
Chi-squared distribution properties
\[E(\chi^2_k) = k, \quad \text{Var}(\chi^2_k) = 2k\]
  • Always non-negative (chi-squared cannot be negative)
  • Right-skewed (especially for small k)
  • Approaches normality as k increases
  • Depends only on degrees of freedom k

For the association test, ν = (r−1)(c−1). For a 2×2 table: ν = 1. For a 3×3 table: ν = 4. Larger tables need larger χ² to reject H₀.

10

Correlation & Linear Regression

Scatterplots; Pearson's r; covariance; simple linear regression; least squares; R²; residual variance; inference for slope; prediction; residual analysis.
Correlation and sum-of-squares

Shortcut formulas

\[S_{xx} = \sum x_i^2 - \frac{(\sum x_i)^2}{n}\]
\[S_{yy} = \sum y_i^2 - \frac{(\sum y_i)^2}{n}\]
\[S_{xy} = \sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n}\]

Sample correlation coefficient

\[r = \frac{S_{xy}}{\sqrt{S_{xx} S_{yy}}}, \quad -1 \le r \le 1\]
Value of rInterpretation
Near +1Strong positive linear relationship
Near −1Strong negative linear relationship
Near 0Weak / no linear relationship
Simple linear regression — model and estimation
\[Y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]

Least squares estimators

\[\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}\]
\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]

Fitted value and residual

\[\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i \qquad \hat{\epsilon}_i = y_i - \hat{y}_i\]

Interpreting coefficients

Slope \(\hat{\beta}_1\)
For each 1-unit increase in x, predicted y changes by \(\hat{\beta}_1\) on average. Sign indicates direction of relationship.
Intercept \(\hat{\beta}_0\)
Predicted y when x = 0. May not be meaningful if x = 0 is outside the data range.
Variation decomposition and R²
\[TSS = S_{yy} \quad \text{(total sum of squares)}\]
\[ESS = \frac{S_{xy}^2}{S_{xx}} \quad \text{(explained / regression SS)}\]
\[RSS = S_{yy} - \frac{S_{xy}^2}{S_{xx}} \quad \text{(residual SS)}\]
\[TSS = ESS + RSS\]
\[R^2 = \frac{ESS}{TSS} = \frac{S_{xy}^2/S_{xx}}{S_{yy}}\]

In simple linear regression only: \(R^2 = r^2\). This equivalence does NOT hold in multiple regression.

Interpretation: R² = 0.72 means 72% of the variation in y is explained by the linear relationship with x. The remaining 28% is unexplained (random noise).

Residual variance and inference
\[s^2 = \frac{RSS}{n-2} \quad \text{(residual variance)}\]

Uses n−2 degrees of freedom because two parameters (β₀ and β₁) are estimated.

Standard errors of estimators

\[\text{E.S.E.}(\hat{\beta}_1) = \frac{s}{\sqrt{S_{xx}}}\]
\[\text{E.S.E.}(\hat{\beta}_0) = s\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}}\]

Hypothesis test for slope

\[H_0: \beta_1 = 0 \quad H_1: \beta_1 \ne 0\]
\[T = \frac{\hat{\beta}_1}{s/\sqrt{S_{xx}}} \sim t_{n-2}\]

Rejecting H₀ means evidence of a significant linear relationship between x and y.

Confidence interval for slope

\[\hat{\beta}_1 \pm t_{\alpha/2,\,n-2}\,\frac{s}{\sqrt{S_{xx}}}\]

If the CI contains 0, the slope may not differ significantly from zero.

Prediction and extrapolation
\[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 x_0\]

Prediction becomes less reliable as x₀ moves further from x̄. Prediction variance increases with distance from x̄.

Avoid extrapolation: do not make predictions outside the observed range of x. The linear relationship may not continue, and the regression model was not built to describe those regions.

Residual analysis

\[\hat{\epsilon}_i = y_i - \hat{y}_i\]

Standardised residuals above 2 are suspicious. Values above 3 strongly suggest an outlier. Residual plots help detect:

  • Outliers — unusual individual points
  • Non-linearity — curved pattern in residuals
  • Heteroscedasticity — increasing/decreasing spread of residuals
Correlation vs regression — key distinctions
FeatureCorrelationRegression
PurposeMeasure strength of linear relationshipModel and predict y from x
Symmetric?Yes (r is same for x,y or y,x)No (regress y on x ≠ regress x on y)
Units?Dimensionless (−1 to 1)β₁ in units of y per unit of x
Scale invariant?YesNo — rescaling x changes β₁

Correlation does not imply causation. Ice cream sales and sunscreen sales are positively correlated — both driven by warm weather. A significant slope in regression also does not prove causation.

Common exam traps — Ch.10
  • R² measures explained variation, not causation. A high R² doesn't mean x causes y.
  • In SLR only: R² = r². This equivalence fails in multiple regression.
  • Zero correlation ≠ no relationship. There could be a non-linear relationship that r misses entirely.
  • Avoid extrapolation: predictions outside the observed x-range are unreliable.
  • Regression df = n−2: two parameters estimated (β₀ and β₁) → n−2 degrees of freedom for residual variance.
  • Intercept interpretation: only meaningful if x = 0 is within the data range. Often just a mathematical necessity with no real interpretation.
Exam methodology — regression question
  • 1
    Compute S_xx, S_yy, S_xy using shortcut formulas. Need: Σx, Σy, Σx², Σy², Σxy, and n.
  • 2
    Slope: β̂₁ = S_xy / S_xx. Intercept: β̂₀ = ȳ − β̂₁ x̄.
  • 3
    Compute RSS = S_yy − S_xy²/S_xx. Then s² = RSS/(n−2).
  • 4
    R² = ESS/TSS = (S_xy²/S_xx) / S_yy.
  • 5
    For inference on slope: compute ESE(β̂₁) = s/√S_xx, then T = β̂₁/ESE, compare to t_{n-2}.
  • 6
    For prediction: substitute x₀ into the fitted equation. State limitations if x₀ is far from x̄.

Final Exam Map — Everything to Know Cold

All key equations; chapter-by-chapter distinctions; critical values; 5-step template.
Essential equations
\[\bar{x}=\frac{\sum x_i}{n},\quad S_{xx}=\sum x_i^2-n\bar{x}^2,\quad s^2=\frac{S_{xx}}{n-1}\]
\[\text{Freq.density}=\frac{\text{Freq}}{\text{Width}},\quad IQR=Q_3-Q_1\]
\[P(A|B)=\frac{P(A\cap B)}{P(B)},\quad P(A|B)=\frac{P(B|A)P(A)}{P(B)}\]
\[{}^nC_r=\frac{n!}{r!(n-r)!},\quad {}^nP_r=\frac{n!}{(n-r)!}\]
\[P(X=x)=\binom{n}{x}\pi^x(1-\pi)^{n-x},\quad P(X=x)=\frac{e^{-\lambda}\lambda^x}{x!}\]
\[\bar{X}\sim N\!\left(\mu,\frac{\sigma^2}{n}\right),\quad P\sim N\!\left(\pi,\frac{\pi(1-\pi)}{n}\right),\quad \frac{(n-1)S^2}{\sigma^2}\sim\chi^2_{n-1}\]
\[\text{MSE}(\hat\theta)=\text{Var}(\hat\theta)+[\text{Bias}(\hat\theta)]^2\]
\[\bar{x}\pm t_{\alpha/2,n-1}\frac{s}{\sqrt{n}},\quad p\pm z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}},\quad n\ge\frac{z_{\alpha/2}^2\sigma^2}{e^2}\]
\[Z=\frac{\bar{X}-\mu_0}{\sigma/\sqrt{n}},\quad T=\frac{\bar{X}-\mu_0}{s/\sqrt{n}},\quad Z=\frac{P-\pi_0}{\sqrt{\pi_0(1-\pi_0)/n}}\]
\[E_{ij}=\frac{R_i\times C_j}{N},\quad \chi^2=\sum\frac{(O-E)^2}{E},\quad \nu=(r-1)(c-1)\]
\[r=\frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}},\quad \hat\beta_1=\frac{S_{xy}}{S_{xx}},\quad \hat\beta_0=\bar{y}-\hat\beta_1\bar{x}\]
\[RSS=S_{yy}-\frac{S_{xy}^2}{S_{xx}},\quad s^2=\frac{RSS}{n-2},\quad R^2=r^2\text{ (SLR only)}\]
Key distinctions — chapter by chapter
  • C1: Freq. density on histogram y-axis when widths unequal. Mean > median = right skew. Outlier > 1.5×IQR from box.
  • C2: Mutually exclusive ≠ independent. P(A|B) denominator = P(B). At least one → complement.
  • C3: "At least x" = 1−F(x−1). Binomial: n+1 values. Scale λ to units. Poisson: mean=variance=λ.
  • C4: P(X=x)=0 for continuous. f(x) is density not probability. Standardise X̄ with σ/√n. CLT → X̄ only.
  • C5: S.E.(X̄)=σ/√n. Use π not p in Var(P). S² needs chi-squared.
  • C6: MSE=Var+Bias². Biased can be better if MSE lower. Unbiasedness not preserved under transforms.
  • C7: CI = long-run coverage. Round n UP. Use p=0.5 if no pilot. z (σ known); t_{n-1} (σ unknown).
  • C8: H₀ has equality. Never "accept H₀." Proportion: use π₀ in SE. Two-proportion: pool p̂.
  • C9: Chi-squared always upper-tailed. E_ij ≥ 5. df=(r−1)(c−1).
  • C10: r=0 ≠ no relationship. Corr ≠ causation. R²=r² SLR only. df=n−2. No extrapolation.
Critical values
Levelz two-tailz upper
10%±1.6451.282
5%±1.961.645
1%±2.5762.326
5-step answer template
1
State H₀, H₁ and distribution under H₀.
2
Check all conditions for the chosen method.
3
Calculate test statistic or CI — show full working.
4
Compare to critical value or p-value vs α.
5
Conclude in context — state what the result means.
Example — proportion test conclusion

"H₀: π=0.3, H₁: π≠0.3. Use π₀=0.3 in SE (not p). Z=(p−0.3)/√(0.3×0.7/n). α=5% two-tailed → reject if |Z|>1.96. Conclude: at 5% significance there is [sufficient/insufficient] evidence that the true proportion differs from 0.3."

11

Key Terms Glossary

Every important term from ST107 — searchable and filterable by chapter.
Frequency density
Frequency divided by class width. Must be used on a histogram y-axis when class widths differ. In a histogram, area (not height) represents frequency. A wider bar that looks tall does not contain more data — frequency density corrects for this.
Descriptive Stats
Corrected sum of squares S_xx
S_xx = Σ(x_i − x̄)² = Σx_i² − nx̄². The shortcut avoids computing each deviation individually. Foundation for sample variance s², and for regression statistics S_xy and S_yy.
Descriptive Stats
Sample variance s²
s² = S_xx/(n−1). Divides by n−1 (not n) to be an unbiased estimator of σ². Units are squared — use s = √s² for interpretation in original units. Why n−1? Because deviations are measured from x̄ (estimated), not from the true μ.
Descriptive Stats
IQR and outlier rule
IQR = Q₃ − Q₁, the spread of the middle 50%. Outlier if > 1.5×IQR below Q₁ or above Q₃. More robust than range because it ignores the extremes. Extreme outlier threshold: > 3×IQR from box.
Descriptive Stats
Skewness — positive and negative
Positive/right skew: mean > median, long tail to right. Negative/left skew: mean < median, long tail to left. Symmetric: mean = median. The mean is pulled toward the long tail by extreme values, while the median is resistant to outliers.
Descriptive Stats
Boxplot
Shows five key features: median (centre line), Q₁ (bottom of box), Q₃ (top of box), whiskers (furthest non-outlier values), and outliers (separate points more than 1.5×IQR from the box). Enables rapid visual reading of location, spread, skewness, and outliers.
Descriptive Stats
Modal class vs modal value
Modal value: the single most frequently occurring raw observation (ungrouped data). Modal class: the class interval with the highest frequency (grouped data). For grouped data, only the modal class can be identified — the exact modal value is unknown.
Descriptive Stats
Sample space S
The complete set of all possible outcomes of an experiment. P(S) = 1. An event is any subset of S. Every probability calculation is relative to S — the probability of an event can never exceed 1.
Probability
Mutually exclusive events
Events that cannot both occur in the same trial: P(A ∩ B) = 0. Example: rolling even AND odd on one die. If A occurs, B definitely did not. Mutually exclusive events with P(A)>0 and P(B)>0 are NOT independent — they are actually dependent.
Probability
Independent events
One event gives no information about the other: P(A|B) = P(A), equivalently P(A∩B) = P(A)P(B). Example: two separate die rolls. Independence is about information content, not about whether events can co-occur.
Probability
Conditional probability P(A|B)
The probability of A given that B has occurred: P(A|B) = P(A∩B)/P(B). The conditioning event B becomes the restricted sample space. Denominator is ALWAYS P(B) — the most common error is using P(A) in the denominator instead.
Probability
Bayes' theorem
Reverses conditional probability: P(A|B) = P(B|A)P(A)/P(B). The denominator P(B) is usually found using the total probability formula first. Classic example: P(disease|positive test) can be surprisingly small when disease prevalence (base rate) is low.
Probability
Permutation ⁿPᵣ
An ordered arrangement of r objects from n: ⁿPᵣ = n!/(n−r)!. Use when order/position/rank matters: arrange, queue, first/second/third. Example: 3 prizes from 10 people → ¹⁰P₃ = 720.
Probability
Combination ⁿCᵣ
A selection of r from n where order is irrelevant: ⁿCᵣ = n!/[r!(n−r)!]. Use when only who is selected matters: choose, committee, group, hand. Example: committee of 4 from 9 → ⁹C₄ = 126.
Probability
Total probability formula
P(A) = Σ P(A|Bᵢ)P(Bᵢ) for a partition B₁,…,Bₙ (mutually exclusive and collectively exhaustive). Sums over all routes that lead to A. Usually the first step before applying Bayes' theorem.
Probability
Binomial distribution
X ~ Bin(n,π): number of successes in n independent Bernoulli trials with constant success probability π. Four conditions: fixed n, fixed π, two outcomes, independence. Mean=nπ, Var=nπ(1−π). X takes n+1 values: 0,1,…,n. P(X=x) = C(n,x)πˣ(1−π)ⁿ⁻ˣ.
Discrete Dists
Poisson distribution
X ~ Pois(λ): random events in a continuous medium (time, area, volume). P(X=x) = e⁻λλˣ/x!. Mean = variance = λ — this equality is the defining characteristic. No upper bound on x. Always scale λ to match the unit in the question before computing.
Discrete Dists
CDF and probability wording
F(x) = P(X≤x). Key translations: "at most x" = F(x); "fewer than x" = F(x−1); "at least x" = 1−F(x−1); "more than x" = 1−F(x). The "at least x" case uses x−1 in the CDF — this is the most commonly confused. Recover: P(X=x) = F(x)−F(x−1).
Discrete Dists
Probability density function (PDF)
f(x): describes the relative likelihood of continuous outcomes. NOT a probability — f(x) can exceed 1. Probability = area under f(x): P(a
Continuous Dists
Standard normal Z ~ N(0,1)
All normal probabilities found by standardising: Z=(X−μ)/σ for individual X; Z=(X̄−μ)/(σ/√n) for sample mean. Tables give Φ(z)=P(Z≤z). Symmetry: P(Z<−z)=1−Φ(z). Most critical difference from individual standardisation: dividing by σ/√n, not σ.
Continuous Dists
Central Limit Theorem
For large n (≥30 rule of thumb), X̄ ≈ N(μ, σ²/n) regardless of the population distribution. Exact if population is normal. CLT applies to the SAMPLE MEAN X̄ — not to individual observations. More skewed populations need larger n for the approximation to be good.
Continuous Dists
Exponential distribution Exp(λ)
Models waiting times. f(x)=λe⁻λˣ, F(x)=1−e⁻λˣ. Mean=1/λ, Var=1/λ². Memoryless: P(X>s+t|X>s) = P(X>t). Paired with Poisson: if arrivals are Pois(λ) events per unit time, inter-arrival times are Exp(λ).
Continuous Dists
Standard error S.E.
The standard deviation of an estimator — its variability across repeated samples. S.E.(X̄)=σ/√n; S.E.(P)=√[π(1−π)/n]. Decreases with larger n. Distinct from E.S.E. (estimated S.E.) which substitutes s or p for unknown σ or π.
Sampling
Chi-squared distribution χ²_k
Non-negative, right-skewed. E(χ²_k)=k, Var(χ²_k)=2k. Key result: (n−1)S²/σ² ~ χ²_{n-1} when sampling from normal population. Also the test statistic distribution for association and goodness-of-fit tests. Critical values from UPPER tail only.
Sampling
Sampling distribution
The probability distribution of a statistic over all possible samples of size n. Describes how an estimator varies from sample to sample. Three key sampling distributions: X̄ ~ N(μ,σ²/n); P ~ N(π,π(1−π)/n); (n−1)S²/σ² ~ χ²_{n-1}.
Sampling
Bias
Bias(θ̂) = E(θ̂)−θ. Systematic error — how far the estimator's average is from the truth. Positive: overestimates. Zero: unbiased. Bias does not decrease with larger n (unlike variance). A biased estimator can be preferred if its MSE is smaller.
Estimation
Mean Squared Error (MSE)
MSE(θ̂) = Var(θ̂) + [Bias(θ̂)]². Captures both random variability and systematic error. For unbiased estimators, MSE = Var. The preferred criterion for comparing estimators. A low-variance biased estimator can have smaller MSE than a high-variance unbiased one.
Estimation
Confidence interval
An interval (L,U) from sample data such that, if the sampling procedure is repeated many times, a specified percentage (e.g. 95%) of constructed intervals would contain the true parameter. NOT "there is 95% probability this specific interval contains θ" — once constructed, it either does or doesn't.
Estimation
Student's t-distribution
Used when σ is unknown and replaced by s. Bell-shaped, symmetric, but fatter tails than N(0,1). As ν→∞, t_ν→N(0,1). Always use t_{n-1} (not z) when estimating σ from data. df = n−1 for single sample; n₁+n₂−2 for two-sample equal-variance.
Estimation
MVUE
Minimum Variance Unbiased Estimator: the unbiased estimator with the smallest variance (= smallest MSE). X̄ is MVUE for μ; S² for σ²; P for π (large n). More efficient than estimators using fewer observations (like the average of just first and last values).
Estimation
Null hypothesis H₀
The default "no effect" statement — always contains equality (=, ≤, ≥). Assumed true; evidence must be strong to overturn it. Test statistic distribution is derived under H₀. Examples: H₀: μ=50; H₀: π=0.3; H₀: π₁=π₂. Never "accept H₀" — only "fail to reject" it.
Hypothesis Tests
Type I / Type II errors
Type I error (α): rejecting H₀ when it is true (false positive). Type II error (β): failing to reject H₀ when it is false (false negative). Power = 1−β = P(correctly reject false H₀). Trade-off: reducing α increases β. Power increases with larger n.
Hypothesis Tests
P-value
P(observing result as extreme or more extreme | H₀ true). Reject H₀ if p-value < α. NOT the probability that H₀ is true. Two-tailed p = 2 × one-tail probability. Smaller p-value = stronger evidence against H₀.
Hypothesis Tests
Pooled proportion p̂ (two-proportion test)
Under H₀: π₁=π₂, estimate the common proportion using: p̂=(r₁+r₂)/(n₁+n₂). Use p̂ in the test statistic SE. Do NOT use p₁ and p₂ separately — the test assumes they are equal under H₀. Contrast: CI uses p₁ and p₂ separately in the ESE.
Hypothesis Tests
Paired samples test
When the same subject is measured twice, form dᵢ=xᵢ−yᵢ and apply a one-sample t-test on the differences. H₀: μ_d=0. T=d̄/(s_d/√n) ~ t_{n-1}. More powerful than independent two-sample test when pairing is appropriate, because between-subject variability is removed.
Hypothesis Tests
Pearson's r
r = S_xy/√(S_xx S_yy). Ranges from −1 to +1. Dimensionless and scale-invariant. r=0 does NOT mean no relationship — could be strong non-linear relationship. Measures linear association only. In simple linear regression: R²=r².
Regression
Least squares regression
Finds ŷ=β̂₀+β̂₁x minimising Σ(yᵢ−ŷᵢ)². Slope β̂₁=S_xy/S_xx. Intercept β̂₀=ȳ−β̂₁x̄. Line always passes through (x̄,ȳ). Slope interpretation: predicted y changes by β̂₁ for each 1-unit increase in x.
Regression
R² (coefficient of determination)
Proportion of total variation in y explained by the regression: R²=ESS/TSS. In SLR only: R²=r². R²=0.72 means 72% of variation in y is explained. Does NOT imply causation; does NOT capture non-linear fit quality.
Regression
Residual ε̂_i
Difference between observed and fitted y: ε̂ᵢ=yᵢ−ŷᵢ. Σε̂ᵢ=0 for OLS. Standardised residual > 2: suspicious; > 3: likely outlier. Residual plots detect: outliers, non-linearity, heteroscedasticity.
Regression
Chi-squared association test
Tests H₀: two categorical variables are independent. χ²=Σ(O−E)²/E, E_ij=(row_i total×col_j total)/N. df=(r−1)(c−1). Always upper-tailed. Requires all E_ij≥5. Large χ² → observed counts far from independence prediction → reject H₀.
Regression
12

Key Equations & Formulas

Every formula from ST107, grouped by chapter, with worked examples and exam notes.
Chapter 1 — Descriptive Statistics
Sample mean
\[\bar{x}=\frac{\sum x_i}{n}\]
Frequency data: x̄=Σf_kx_k/Σf_k. Mean is pulled by outliers; median is not. Mean > median → right skew.
Corrected sum of squares
\[S_{xx}=\sum x_i^2-n\bar{x}^2\]
Shortcut avoids computing (xᵢ−x̄)² one by one. Used in s², regression slope, correlation coefficient.
Sample variance
\[s^2=\frac{S_{xx}}{n-1}\]
Divides by n−1 for unbiasedness. Units are squared. s=√s² is in original units for interpretation.
Frequency density
\[\text{Freq.density}=\frac{\text{Frequency}}{\text{Class width}}\]
REQUIRED on histogram y-axis when class widths differ. Area = frequency. Wider class not necessarily more data.
IQR and outlier boundary
\[IQR=Q_3-Q_1,\quad \text{Outlier if} > 1.5 \times IQR \text{ from box}\]
Lower fence: Q₁−1.5×IQR. Upper fence: Q₃+1.5×IQR. IQR captures middle 50%; range captures everything including extremes.
Grouped median
\[\text{Median}=L+w\times\frac{\text{needed}}{\text{class freq}}\]
L=lower bound, w=class width. Find which class contains the n/2-th observation using cumulative frequencies, then interpolate.
Chapter 2 — Probability
Equally likely P(A)
\[P(A)=\frac{n}{N}\]
Only valid when all outcomes are equally likely. n=favourable, N=total. Check this assumption first.
Permutations
\[{}^nP_r=\frac{n!}{(n-r)!}\]
Order matters: arrange, rank, queue. ⁶P₄=6×5×4×3=360. Always ≥ ⁿCᵣ.
Combinations
\[{}^nC_r=\frac{n!}{r!(n-r)!}\]
Order irrelevant: choose, committee, group. ⁶C₄=15. ⁿCᵣ = ⁿPᵣ/r! since r! arrangements of same r items all count as one.
Additive law
\[P(A\cup B)=P(A)+P(B)-P(A\cap B)\]
Subtract intersection to avoid double-counting. If ME: P(A∩B)=0, simplifies to P(A)+P(B).
Complement
\[P(A^c)=1-P(A)\]
Powerful for "at least one": P(≥1 success)=1−P(no success). Often much simpler than computing directly.
Conditional probability
\[P(A|B)=\frac{P(A\cap B)}{P(B)}\]
Denominator = P(B) always. Restricts sample space to B. Common error: using P(A) in denominator.
Bayes' theorem
\[P(A|B)=\frac{P(B|A)\,P(A)}{P(B)}\]
Reverse conditional: "given test positive, what's P(disease)?" Use total probability for P(B) first. Base rate matters — even an accurate test gives low P(disease|+) when prevalence is low.
Total probability
\[P(A)=\sum_i P(A|B_i)P(B_i)\]
For a partition of S. On a tree: multiply along each branch to A, then sum across all branches leading to A.
Chapter 3 — Discrete Distributions
Binomial pmf
\[P(X=x)=\binom{n}{x}\pi^x(1-\pi)^{n-x}\]
X ~ Bin(n,π). Mean=nπ, Var=nπ(1−π). Four conditions: fixed n, fixed π, two outcomes, independence. X takes 0,1,…,n (n+1 values).
Poisson pmf
\[P(X=x)=\frac{e^{-\lambda}\lambda^x}{x!}\]
Mean=variance=λ (key property). Random events in continuous medium. No upper bound. Always scale λ to match question units.
CDF wording translations
\[P(X\ge x)=1-F(x-1),\quad P(X>x)=1-F(x)\]
"At least x" uses x−1 in the CDF. "More than x" uses x. The off-by-one is the most common exam error in this chapter.
Poisson approx. to binomial
\[\text{Bin}(n,\pi)\approx\text{Pois}(\lambda=n\pi)\]
When n>30, nπ<10, π extreme. Avoids large binomial coefficients. Better approximation for smaller π.
Chapter 4 — Continuous Distributions
Probability from PDF
\[P(a
P(X=x)=0. Endpoints don't matter. f(x) is density not probability. Use CDF difference to avoid integration when F is available.
Uniform U(a,b)
\[f(x)=\frac{1}{b-a},\quad E(X)=\frac{a+b}{2},\quad \text{Var}=\frac{(b-a)^2}{12}\]
P(c
Exponential Exp(λ)
\[P(X>x)=e^{-\lambda x},\quad E(X)=\frac{1}{\lambda},\quad \text{Var}=\frac{1}{\lambda^2}\]
Memoryless property. Paired with Poisson: if arrivals are Pois(λ), waiting times are Exp(λ).
Standardisation — X
\[Z=\frac{X-\mu}{\sigma}\sim N(0,1)\]
P(X
Standardisation — X̄ (CLT)
\[Z=\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\sim N(0,1)\]
Divide by σ/√n NOT σ — the most common exam error. Exact for normal population; approximate (n≥30) otherwise.
Chapter 5 — Sampling Distributions
Sampling dist. of X̄
\[E(\bar{X})=\mu,\quad \text{Var}(\bar{X})=\frac{\sigma^2}{n},\quad \text{S.E.}=\frac{\sigma}{\sqrt{n}}\]
Var is in squared units; S.E. is in original units. Doubling n reduces S.E. by √2, not 2. Halving S.E. requires 4× larger n.
Sampling dist. of P
\[E(P)=\pi,\quad \text{Var}(P)=\frac{\pi(1-\pi)}{n},\quad Z=\frac{P-\pi}{\sqrt{\pi(1-\pi)/n}}\]
Use π (not p) in theoretical variance. Large n approximation. S.E.(P) = √[π(1−π)/n].
Chi-squared for S²
\[\frac{(n-1)S^2}{\sigma^2}\sim\chi^2_{n-1}\]
Valid only for normal populations. Transform inequality to get P(χ²_{n-1} > value) and use chi-squared table. E(χ²_k)=k.
Chapter 6 — Point Estimation
Bias
\[\text{Bias}(\hat\theta)=E(\hat\theta)-\theta\]
>0: overestimates. =0: unbiased. Does not decrease with n — systematic error. Standard estimators: E(X̄)=μ, E(P)=π, E(S²)=σ².
MSE decomposition
\[\text{MSE}(\hat\theta)=\text{Var}(\hat\theta)+[\text{Bias}(\hat\theta)]^2\]
Compare MSE (not just bias) when choosing estimators. Biased estimator can win if Var is much smaller. For unbiased: MSE=Var.
Non-linear transform warning
\[E(\hat\theta^2)=\text{Var}(\hat\theta)+\theta^2>\theta^2\]
Unbiasedness is NOT preserved under non-linear transforms. S² unbiased for σ², but S NOT unbiased for σ. X̄ unbiased for μ, but 1/X̄ generally not unbiased for 1/μ.
Chapter 7 — Interval Estimation
Single mean CI (σ known)
\[\bar{x}\pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\]
Use only when σ is given exactly. 95% → z=1.96; 99% → z=2.576; 90% → z=1.645.
Single mean CI (σ unknown)
\[\bar{x}\pm t_{\alpha/2,\,n-1}\frac{s}{\sqrt{n}}\]
ν=n−1 df. t-distribution has fatter tails → wider CI than z for same confidence level.
Single proportion CI
\[p\pm z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}}\]
Uses p in ESE (contrast with hypothesis test which uses π₀). Large n approximation required.
Sample size — mean
\[n\ge\frac{z_{\alpha/2}^2\,\sigma^2}{e^2}\]
e=desired margin of error. ALWAYS ROUND UP. 95%, σ=0.05, e=0.01: n≥96.04 → n=97.
Sample size — proportion
\[n\ge\frac{z_{\alpha/2}^2\,p(1-p)}{e^2}\]
Use p=0.5 if no pilot estimate — maximises p(1−p)=0.25 giving largest (conservative) sample size. ROUND UP.
Pooled variance s²_p
\[s_p^2=\frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{n_1+n_2-2}\]
df=n₁+n₂−2. Use when equal variances assumed. Weights by degrees of freedom for better estimate of common σ².
Paired samples CI
\[d_i=x_i-y_i,\quad \bar{d}\pm t_{\alpha/2,n-1}\frac{s_d}{\sqrt{n}}\]
Reduce to one-sample problem. df=n−1 (n=number of pairs). Use when same subject measured twice.
Chapter 8 — Hypothesis Testing
Single mean test (σ known)
\[Z=\frac{\bar{X}-\mu_0}{\sigma/\sqrt{n}}\sim N(0,1)\]
Example: x̄=1570, μ₀=1600, σ=120, n=100: Z=(1570−1600)/12=−2.5. Two-tailed 5%: reject if |Z|>1.96. Yes, reject.
Single mean test (σ unknown)
\[T=\frac{\bar{X}-\mu_0}{s/\sqrt{n}}\sim t_{n-1}\]
ν=n−1. Look up t critical value, not z. Use when σ is estimated from data.
Single proportion test
\[Z=\frac{P-\pi_0}{\sqrt{\pi_0(1-\pi_0)/n}}\sim N(0,1)\]
CRITICAL: use π₀ (null value) in SE denominator — NOT p. Under H₀ we assume π=π₀ is true. Differs from CI ESE.
Two-proportion pooled test
\[\hat{p}=\frac{r_1+r_2}{n_1+n_2},\quad Z=\frac{p_1-p_2}{\sqrt{\hat{p}(1-\hat{p})(1/n_1+1/n_2)}}\]
Under H₀: π₁=π₂, pool both samples. Do NOT use p₁,p₂ separately in test SE. Pooled p̂ reflects equality assumption.
Paired test
\[T=\frac{\bar{d}}{s_d/\sqrt{n}}\sim t_{n-1},\quad d_i=x_i-y_i\]
H₀: μ_d=0. Compute differences first, then apply one-sample t-test. ν=n−1 pairs.
Chapter 9 — Chi-Squared Tests
Expected frequency
\[E_{ij}=\frac{R_i\times C_j}{N}\]
Computed under H₀ (independence). All E_ij must be ≥5 for chi-squared approximation to be valid.
Chi-squared statistic
\[\chi^2=\sum_{i,j}\frac{(O_{ij}-E_{ij})^2}{E_{ij}}\]
Always ≥0. Large = evidence against H₀ (independence). Always upper-tailed. Cells with large contributions identify nature of association.
Degrees of freedom
\[\nu=(r-1)(c-1)\text{ (association)},\quad \nu=k-1-m\text{ (GOF)}\]
Association: r rows, c columns. GOF: k categories, m parameters estimated from data. 2×2 table: ν=1.
Chapter 10 — Correlation and Regression
Sum-of-squares shortcut
\[S_{xy}=\sum x_iy_i-\frac{\sum x_i\sum y_i}{n}\]
Also: S_xx=Σx²−(Σx)²/n, S_yy=Σy²−(Σy)²/n. Compute all three first in any regression question.
Correlation coefficient
\[r=\frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}},\quad -1\le r\le 1\]
r=0 ≠ no relationship. Scale-invariant. Correlation ≠ causation. In SLR: R²=r².
Regression coefficients
\[\hat\beta_1=\frac{S_{xy}}{S_{xx}},\quad \hat\beta_0=\bar{y}-\hat\beta_1\bar{x}\]
Line passes through (x̄,ȳ). β̂₁ interpretation: predicted y changes by β̂₁ per unit increase in x.
Residual SS and R²
\[RSS=S_{yy}-\frac{S_{xy}^2}{S_{xx}},\quad R^2=1-\frac{RSS}{S_{yy}}\]
TSS=S_yy, ESS=S_xy²/S_xx. TSS=ESS+RSS. R²=r² in SLR only. Interpet as proportion of variation explained.
Residual variance + slope test
\[s^2=\frac{RSS}{n-2},\quad T=\frac{\hat\beta_1}{s/\sqrt{S_{xx}}}\sim t_{n-2}\]
df=n−2 (two parameters). H₀: β₁=0 (no linear relationship). CI: β̂₁±t_{α/2,n-2}×s/√S_xx. Avoid extrapolation.