ST107 Probability & Statistics Revision

Data Visualisation & Descriptive Statistics

Types of variables; histograms; stem-and-leaf; dot plots; boxplots; measures of location; measures of dispersion; skewness; variance and standard deviation.

Types of variables

Type	Meaning	Examples
Discrete	Counted — non-negative integers	Number of passengers, calls per day
Continuous	Measured — can take decimals	Height, weight, time
Nominal categorical	Categories with no natural order	Political party, eye colour
Ordinal categorical	Categories with a natural order	Dissatisfied / indifferent / satisfied

Key distinction: measurable variables have a recognised measurement method; categorical variables classify observations into groups.

Measures of location

\[\bar{x} = \frac{\sum_{i=1}^n x_i}{n} \quad \text{(sample mean)}\]

\[\bar{x} = \frac{\sum f_k x_k}{\sum f_k} \quad \text{(frequency data)}\]

Median

\[\text{Odd } n: \quad x_{\left(\frac{n+1}{2}\right)} \qquad \text{Even } n: \quad \frac{x_{(n/2)} + x_{(n/2+1)}}{2}\]

Grouped-data median (interpolation)

\[\text{Median} = \text{lower bound} + \text{class width} \times \frac{\text{remaining observations needed}}{\text{class frequency}}\]

Skewness from location measures

Relationship	Skew type	Tail direction
Mean > median	Positive / right skew	Long tail to the right
Mean < median	Negative / left skew	Long tail to the left
Mean = median	Symmetric	Balanced

Measures of dispersion

\[\text{Range} = x_{(n)} - x_{(1)}\]

\[\text{IQR} = Q_3 - Q_1\]

\[S_{xx} = \sum(x_i - \bar{x})^2 = \sum x_i^2 - n\bar{x}^2\]

\[s^2 = \frac{S_{xx}}{n-1} \quad \text{(sample variance)}\]

\[s = \sqrt{s^2} \quad \text{(sample standard deviation)}\]

Why n−1? Dividing by n−1 instead of n makes s² an unbiased estimator of σ². It corrects for the fact that sample deviations are measured around the sample mean, not the true population mean.

Outlier rule (boxplot): a value is an outlier if it lies more than 1.5×IQR below Q₁ or above Q₃.

Histograms — area represents frequency

When class widths are unequal, use frequency density on the y-axis:

\[\text{Frequency density} = \frac{\text{Frequency}}{\text{Class width}}\]

Why this matters

A wider class that looks tall on a raw-frequency y-axis doesn't contain more data — it's just wider. Frequency density corrects for this so area (not height) represents frequency.

Example

Class	Width	Frequency	Freq. density
[300, 360)	60	6	6/60 = 0.100
[360, 380)	20	14	14/20 = 0.700

Histogram exam checklist

Informative title stating what the data are
x-axis label showing the variable
y-axis labelled "frequency density" (when widths unequal)
Sensible class boundaries (around 6–7 bins is a good guide)
Correct frequency densities calculated
Bars match intervals exactly — no gaps

Boxplots — five-number summary

Feature	What it shows
Middle line	Median
Bottom of box	Q₁ (lower quartile)
Top of box	Q₃ (upper quartile)
Box length	IQR
Whiskers	Furthest non-outlier values
Separate points	Outliers (>1.5×IQR from box)

Boxplots allow quick visual reading of median, quartiles, IQR, range, outliers, and skewness. If the median is closer to Q₁, the distribution is right-skewed; closer to Q₃ implies left-skewed.

Stem-and-leaf diagrams

Stem-and-leaf diagrams preserve the raw data while showing the distribution. The stem is usually tens/hundreds; the leaf is units.

Exam requirements

Informative title
Stem and leaf labels (include units)
Sensible stems — not too broad, not too narrow
Vertical alignment for readability
Leaves in ascending order
Every observation included exactly once

Example

Data: 350, 354, 364, 368 → Stem 35 | 0 4 and Stem 36 | 4 8

Choosing between diagrams

Diagram	Best for
Dot plot	Small datasets — shows clustering and gaps clearly
Histogram	Frequency distribution of discrete or continuous data
Stem-and-leaf	Preserving raw data while showing distribution
Boxplot	Comparing distributions; spotting outliers and skewness

A good diagram is clear, honest, highlights patterns, and allows the reader to extract information quickly. A bad diagram confuses or misleads.

Common exam traps — Ch.1

Histogram y-axis: if class widths are unequal, you MUST use frequency density. Raw frequency on the y-axis is wrong.
Mean vs median for skewness: mean > median → positive (right) skew. Mean < median → negative (left) skew.
Mode: modal value = most frequent single observation; modal class = class interval with highest frequency.
Range vs IQR: range is highly sensitive to outliers; IQR captures the middle 50% and is more robust.
Variance units: variance is in squared units (e.g., kg²). Standard deviation is in original units (kg). Always use s.d. for interpretation.

Mode clarification

Modal value

The single most frequently occurring raw observation. Can only be identified exactly from ungrouped data.

Modal class

The class interval with the highest frequency. Used for grouped data. Not the exact mode.

A dataset can be unimodal (one mode), bimodal (two modes), or multimodal (several modes).

Probability Theory

Sample spaces; axioms; combinatorics; additive law; independence; conditional probability; Bayes' theorem; total probability.

Foundations: experiments, sample spaces, events

An experiment is a process with uncertain outcomes. The sample space S is the set of all possible outcomes. An event is any subset of S.

Experiment	Sample space S
Toss a coin	{H, T}
Roll a die	{1, 2, 3, 4, 5, 6}

Probability rules: \(0 \le P(A) \le 1\). If \(P(A) = 0\) the event is impossible; if \(P(A) = 1\) it is certain.

Equally likely outcomes

\[P(A) = \frac{n}{N}\]

where n = favourable outcomes, N = total equally likely outcomes.

Key set notation

Symbol	Meaning
\(A \cup B\)	A or B (union)
\(A \cap B\)	A and B (intersection)
\(A^c\)	Not A (complement)
\(A \mid B\)	A given B (conditional)

Core probability laws

Axioms

\[P(A) \ge 0 \quad P(S) = 1 \quad P\!\left(\bigcup_i A_i\right) = \sum_i P(A_i) \text{ (if mutually exclusive)}\]

Additive law

\[P(A \cup B) = P(A) + P(B) - P(A \cap B)\]

Complement

\[P(A^c) = 1 - P(A)\]

Independence

\[P(A \cap B) = P(A)\,P(B)\]

Conditional probability

\[P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0\]

Bayes' formula

\[P(A \mid B) = \frac{P(B \mid A)\,P(A)}{P(B)}\]

Total probability

\[P(A) = P(A \mid B)\,P(B) + P(A \mid B^c)\,P(B^c)\]

Mutually exclusive vs independent — critical distinction

Mutually exclusive

Events cannot both occur: \(P(A \cap B) = 0\). If A happens, B cannot happen in the same trial. Example: rolling an even number AND an odd number on one die.

Independent

One event gives no information about the other: \(P(A|B) = P(A)\). Example: rolling two separate dice — first die doesn't affect second.

These are NOT the same. Mutually exclusive events with positive probability are actually dependent — knowing one occurred tells you the other didn't. If A and B are mutually exclusive and P(A) > 0 and P(B) > 0, then they CANNOT be independent.

Combinatorics — permutations and combinations

\[n! = n(n-1)(n-2)\cdots 1, \quad 0! = 1\]

\[{}^n P_r = \frac{n!}{(n-r)!} \quad \text{(permutations — ORDER matters)}\]

\[{}^n C_r = \binom{n}{r} = \frac{n!}{r!(n-r)!} \quad \text{(combinations — ORDER does NOT matter)}\]

Decision guide

Use permutations when:

"arrange", "queue", "rank", "order", "first/second/third place" — the position matters

Use combinations when:

"choose", "select", "committee", "group", "hand of cards" — only who is selected matters

Examples

Arranging 4 people in a queue from 6: \({}^6P_4 = 6\times5\times4\times3 = 360\)

Choosing 4 people from 6: \({}^6C_4 = \frac{6!}{4!\,2!} = 15\)

Bayes' theorem — disease testing example

Suppose a disease has prevalence 1% (P(D) = 0.01). A test is 99% sensitive (P(+|D) = 0.99) and 95% specific (P(−|D^c) = 0.95, so P(+|D^c) = 0.05).

\[P(+) = P(+|D)P(D) + P(+|D^c)P(D^c)\]

\[= 0.99(0.01) + 0.05(0.99) = 0.0099 + 0.0495 = 0.0594\]

\[P(D|+) = \frac{P(+|D)\,P(D)}{P(+)} = \frac{0.0099}{0.0594} \approx 16.7\%\]

Despite the highly accurate test, a positive result only gives a ~17% probability of disease because the disease is rare (low base rate). False positives dominate when the disease prevalence is low. This is the classic base-rate fallacy.

Conditional probability — exam methodology

1
Draw a tree diagram or Venn diagram to visualise the probability space.
2
Identify the conditioning event (the "given" event) — this becomes your new sample space.
3
Apply: \(P(A|B) = P(A \cap B) / P(B)\). Denominator is always P(B), NOT P(A).
4
For Bayes: use total probability to find P(B) first, then apply Bayes formula.
5
Check: is independence being assumed? If yes, \(P(A \cap B) = P(A)P(B)\) and \(P(A|B) = P(A)\).

Common exam traps — Ch.2

Conditional probability denominator: in P(A|B), divide by P(B), not P(A). The conditioning event is in the denominator.
Mutually exclusive ≠ independent: if two events with positive probability are mutually exclusive, they cannot be independent.
At least one trick: \(P(\text{at least one}) = 1 - P(\text{none})\). Often much easier to compute via the complement.
Permutation vs combination: order matters → permutation; order irrelevant → combination. Don't confuse these.
Equally likely assumption: P(A) = n/N only works when all outcomes are equally likely. Always check this assumption first.

Total probability and tree diagrams

For any partition B₁, B₂, …, Bₙ of S (mutually exclusive and collectively exhaustive):

\[P(A) = \sum_{i=1}^n P(A \mid B_i)\,P(B_i)\]

This is most naturally read from a tree diagram: multiply probabilities along branches, then add across branches that lead to A.

Partition conditions

Mutually exclusive

At most one B_i can occur — no overlap

Collectively exhaustive

At least one B_i must occur — they cover all possibilities

Discrete Probability Distributions

Random variables; discrete uniform; Bernoulli; binomial; Poisson; CDF; expected value; variance; Poisson approximation to binomial.

Distributions — at a glance

Distribution	Mean	Variance	P(X = x)
Discrete Uniform(k)	\(\dfrac{k+1}{2}\)	—	\(\dfrac{1}{k}\)
Bernoulli(π)	\(\pi\)	\(\pi(1-\pi)\)	\(\pi^x(1-\pi)^{1-x}\)
Binomial(n, π)	\(n\pi\)	\(n\pi(1-\pi)\)	\(\binom{n}{x}\pi^x(1-\pi)^{n-x}\)
Poisson(λ)	\(\lambda\)	\(\lambda\)	\(\dfrac{e^{-\lambda}\lambda^x}{x!}\)

Poisson key feature: mean = variance = λ. This is the unique identifying characteristic of a Poisson distribution.

Binomial distribution — conditions

X ~ Bin(n, π) applies when ALL four conditions hold:

Condition	Meaning
Two outcomes	Success or failure only
Fixed probability	Same π every trial
Fixed number of trials	n is known before the experiment
Independent trials	Each trial does not affect the next

\[P(X = x) = \binom{n}{x}\pi^x(1-\pi)^{n-x}, \quad x = 0, 1, \ldots, n\]

A binomial random variable has n+1 possible values (0 through n), not n.

Poisson distribution

Models the number of random events in a continuous medium (time, area, distance, volume).

\[P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!}, \quad x = 0, 1, 2, \ldots\]

Poisson conditions

Events occur randomly in a continuous medium.
Each event is equally likely to occur anywhere.
Events are independent.
λ is the average rate per unit — must match the unit in the question.

Adjusting λ to match the unit

Context	λ
Average 3.2 breakdowns per week; question about 1 week	λ = 3.2
Same; question about 2 weeks	λ = 6.4
Same; question about half a week	λ = 1.6

CDF and probability wording

\[F(x) = P(X \le x) = \sum_{k=0}^{x} P(X=k)\]

\[P(X = x) = F(x) - F(x-1)\]

Translating exam wording

Wording	Mathematical form
Exactly x	\(P(X = x)\)
At most x	\(P(X \le x) = F(x)\)
Fewer than x	\(P(X < x) = P(X \le x-1) = F(x-1)\)
More than x	\(P(X > x) = 1 - F(x)\)
At least x	\(P(X \ge x) = 1 - F(x-1)\)

Most common error: "at least x" is \(1 - F(x-1)\), NOT \(1 - F(x)\). The "at least" boundary is included.

Poisson approximation to binomial

If X ~ Bin(n, π), approximate with X ~ Pois(λ) where λ = nπ when:

n > 30 (many trials)
nπ < 10 (expected successes small)
π is extreme (very small or very large)
x is small relative to n (counting rare events)

Example

n = 100, π = 0.02:

\[\lambda = n\pi = 100 \times 0.02 = 2\]

\[\text{Bin}(100, 0.02) \approx \text{Pois}(2)\]

Poisson is simpler here because you don't need to compute \(\binom{100}{x}\) for each x.

Expected value and variance

\[E(X) = \mu = \sum_{i} x_i p_i\]

The expected value is the probability-weighted average — NOT a simple average of all possible values, since unlikely values get less weight.

\(\bar{x}\)

Sample mean from observed data — a statistic

\(\mu = E(X)\)

Population mean from theoretical distribution — a parameter

Variance shortcut

\[\text{Var}(X) = E(X^2) - [E(X)]^2\]

Choosing the right distribution

Scenario	Distribution
Fixed n trials, two outcomes, fixed π, independence	Binomial(n, π)
Random events in continuous medium (time/area), independence	Poisson(λ)
Single trial with success/failure	Bernoulli(π)
k equally likely outcomes	Discrete Uniform(k)
Binomial with n > 30, nπ < 10, small π	Approximate with Poisson(nπ)

Binomial vs Poisson — key question

Is there a fixed upper bound on the number of events? If yes → Binomial. If events could theoretically continue indefinitely → Poisson. Example: number of calls in one hour has no fixed upper bound → Poisson.

Common exam traps — Ch.3

"At least" uses x−1: P(X ≥ x) = 1 − P(X ≤ x−1) = 1 − F(x−1).
Binomial has n+1 values: X ~ Bin(n, π) can take values 0, 1, …, n. That's n+1 possible values.
Poisson units must match: always adjust λ to the time period/area in the question before computing.
CDF recovery: P(X = x) = F(x) − F(x−1). Do not forget to subtract F(x−1).
Poisson: mean = variance = λ. This is a key identifying feature when asked to justify model choice.
Check binomial conditions: if n is not fixed, or π changes, or observations are not independent, binomial does not apply.

Continuous Probability Distributions

PDFs and CDFs; continuous uniform; exponential distribution; normal distribution; standard normal; standardisation; central limit theorem.

Core continuous distributions

Distribution	Mean	Variance
Uniform U(a, b)	\(\dfrac{a+b}{2}\)	\(\dfrac{(b-a)^2}{12}\)
Exponential Exp(λ)	\(\dfrac{1}{\lambda}\)	\(\dfrac{1}{\lambda^2}\)
Normal N(μ, σ²)	\(\mu\)	\(\sigma^2\)

PDFs

\[\text{Uniform: } f(x) = \frac{1}{b-a}, \quad a \le x \le b\]

\[\text{Exponential: } f(x) = \lambda e^{-\lambda x}, \quad x \ge 0\]

\[\text{Exponential CDF: } F(x) = 1 - e^{-\lambda x}\]

\[\text{Uniform CDF: } F(x) = \frac{x-a}{b-a}, \quad a \le x \le b\]

Continuous random variables — the key idea

For a continuous random variable:

\[P(X = x) = 0 \quad \text{for any single value } x\]

This is not because the value is impossible — it's because there are infinitely many possible decimal values, so any single point has zero probability mass. Therefore:

\[P(a < X < b) = P(a \le X < b) = P(a < X \le b) = P(a \le X \le b)\]

Endpoints make no difference. Probability comes from area under the PDF, not height.

Valid PDF conditions

\[f(x) \ge 0 \quad \text{and} \quad \int_S f(x)\,dx = 1\]

Computing probabilities

\[P(a < X < b) = \int_a^b f(x)\,dx = F(b) - F(a)\]

Mean and variance of continuous distributions

\[E(X) = \int_S x\,f(x)\,dx\]

\[\text{Var}(X) = E(X^2) - \mu^2, \quad E(X^2) = \int_S x^2 f(x)\,dx\]

The median m satisfies F(m) = 0.5. The mode is where f(x) is maximised.

PDF ↔ CDF relationship

\[F(x) = \int_{-\infty}^{x} f(t)\,dt \qquad f(x) = F'(x)\]

Given	To find	Operation
PDF f(x)	CDF F(x)	Integrate
CDF F(x)	PDF f(x)	Differentiate

Normal distribution

The most important distribution in statistics. Bell-shaped, symmetric.

\[X \sim N(\mu, \sigma^2)\]

Symmetric about μ, so mean = median = mode
Larger σ² → flatter, more spread out curve
No closed-form CDF — use tables for the standard normal Z ~ N(0,1)

Standardisation

\[Z = \frac{X - \mu}{\sigma} \sim N(0,1)\]

\[P(X < a) = P\!\left(Z < \frac{a - \mu}{\sigma}\right)\]

Standardisation exam method

1
Write down the probability you want: e.g., P(X < 120).
2
Standardise: Z = (120 − μ) / σ.
3
Look up Φ(z) from the standard normal table.
4
Use symmetry if z is negative: P(Z < −z) = 1 − P(Z < z) = 1 − Φ(z).

Exponential distribution — key properties

Used for waiting times and times between events in a Poisson process.

\[P(X < x) = 1 - e^{-\lambda x} \qquad P(X > x) = e^{-\lambda x}\]

Poisson-Exponential link

If arrivals follow a Poisson process with rate λ events per unit time, then the time between consecutive events follows Exp(λ). The two distributions are naturally paired.

Memoryless property

The exponential distribution has no memory: P(X > s+t | X > s) = P(X > t). If a machine has lasted s hours, the probability it lasts another t hours is the same as it was at the start. Past survival gives no information about future survival.

Central Limit Theorem

One of the most powerful results in statistics:

\[\bar{X} \sim N\!\left(\mu, \frac{\sigma^2}{n}\right) \quad \text{approximately, for large } n\]

Rule of thumb: n ≥ 30 is usually sufficient. This holds regardless of the shape of the original population distribution.

Population already normal

X̄ is exactly normal for any n, no matter how small

Population non-normal (e.g. exponential)

X̄ is approximately normal once n ≥ 30. More skewed populations need larger n

Continuous vs discrete — key differences

Feature	Discrete	Continuous
P(X = x)	Can be positive	Always 0
Probability tool	pmf p(x)	pdf f(x) — density, not probability
Interval prob.	Sum p(x)	Integrate f(x) — area under curve
CDF shape	Step function (jumps)	Smooth increasing function
Endpoint inclusion	Matters (< vs ≤)	Doesn't matter (zero prob at points)

Common exam traps — Ch.4

P(X = x) = 0 for continuous distributions — never write a positive value for exact-point probability.
PDF is not a probability. f(x) can exceed 1. Only area under f(x) gives probability.
Standardise correctly: for X, Z = (X−μ)/σ. For X̄, Z = (X̄−μ)/(σ/√n). Common error: forgetting the √n.
Normal tables: most tables give P(Z ≤ z). For P(Z ≥ z) use 1 − Φ(z). For negative z, use Φ(−z) = 1 − Φ(z).
CLT applies to X̄, not to individual X values. A single observation from a skewed distribution is not normal just because n = 30.

Sampling Distributions of Statistics

Population vs sample; estimators; standard error; sampling distribution of X̄; sampling distribution of P; chi-squared distribution; sample variance.

Population vs sample — notation

Population quantity	Symbol	Sample counterpart	Symbol
Mean	μ	Sample mean	\(\bar{x}\)
Variance	σ²	Sample variance	s²
Standard deviation	σ	Sample s.d.	s
Proportion	π	Sample proportion	p

Estimator vs estimate

Estimator \(\hat{\theta}\)

A random variable (rule/formula). Varies from sample to sample. E.g., \(\bar{X} = \frac{1}{n}\sum X_i\)

Estimate \(\hat{\theta}\)

A specific number from one sample. E.g., if sample gives values 4,8,2,6 then \(\bar{x} = 5\)

The sampling distribution describes how the estimator varies across all possible samples.

Sampling distribution of X̄

\[E(\bar{X}) = \mu \qquad \text{Var}(\bar{X}) = \frac{\sigma^2}{n}\]

\[\bar{X} \sim N\!\left(\mu, \frac{\sigma^2}{n}\right)\]

Exact if population is normal; approximate for large n by CLT.

Standard error

\[\text{S.E.}(\bar{X}) = \frac{\sigma}{\sqrt{n}}\]

Standard error = standard deviation of the estimator. Decreases as n increases — larger samples give more precise estimates.

Standardisation for X̄

\[Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0,1)\]

Critical exam point: divide by σ/√n, NOT by σ. The standard error accounts for averaging over n observations.

Sampling distribution of sample proportion P

Let R = number of successes in n trials, so R ~ Bin(n, π). The sample proportion is P = R/n.

\[E(P) = \pi \qquad \text{Var}(P) = \frac{\pi(1-\pi)}{n}\]

\[P \sim N\!\left(\pi, \frac{\pi(1-\pi)}{n}\right) \quad \text{approximately (large } n\text{)}\]

\[\text{S.E.}(P) = \sqrt{\frac{\pi(1-\pi)}{n}}\]

\[Z = \frac{P - \pi}{\sqrt{\pi(1-\pi)/n}} \sim N(0,1)\]

Use π (the true population proportion) in the theoretical variance, not p (the sample proportion). When π is unknown in practice, p is substituted, but always use the correct notation in theoretical derivations.

Chi-squared distribution and sample variance

If data come from N(μ, σ²):

\[\frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\]

Equivalently, since S_xx = (n−1)S²:

\[\frac{S_{xx}}{\sigma^2} \sim \chi^2_{n-1}\]

Chi-squared properties

\[E(\chi^2_k) = k \qquad \text{Var}(\chi^2_k) = 2k\]

The chi-squared distribution is always non-negative and right-skewed. Different values of degrees of freedom k give different shapes — as k increases, the distribution approaches normality.

Exam methodology for S² questions

1
Identify n and σ² from the question.
2
Transform: multiply both sides of the inequality by (n−1)/σ².
3
The transformed quantity follows χ²_{n-1}.
4
Look up the chi-squared table with n−1 degrees of freedom.

Example

n=15, σ²=2, find P(S² > 4):

\[P(S^2 > 4) = P\!\left(\chi^2_{14} > \frac{14 \times 4}{2}\right) = P(\chi^2_{14} > 28)\]

Standard error — key comparisons

Estimator	Standard error	Decreases with
\(\bar{X}\)	\(\sigma/\sqrt{n}\)	Larger n
P	\(\sqrt{\pi(1-\pi)/n}\)	Larger n

Doubling n reduces the standard error by a factor of √2, not 2. To halve the standard error, you need to quadruple n.

Var(X̄) vs S.E.(X̄)

Var(X̄) = σ²/n

The variance of the estimator. In squared units of the original data.

S.E.(X̄) = σ/√n

Standard deviation of the estimator. In original data units. Used for confidence intervals and standardisation.

Common exam traps — Ch.5

Divide by σ/√n, not σ: when standardising X̄, use Z = (X̄ − μ)/(σ/√n). The most common error in this chapter.
Var vs S.E.: Var(X̄) = σ²/n (squared units); S.E.(X̄) = σ/√n (original units). Don't mix them up.
Use π in theoretical variance of P: Var(P) = π(1−π)/n, not p(1−p)/n in theory.
Sample variance uses chi-squared: never use the normal distribution directly for S². Transform to (n−1)S²/σ² ~ χ²_{n-1} first.
Upper tail of chi-squared: chi-squared tables often give upper-tail probabilities. Check what your table gives.

Three sampling distributions summarised

\[\bar{X} \sim N\!\left(\mu,\, \frac{\sigma^2}{n}\right)\]

\[P \sim N\!\left(\pi,\, \frac{\pi(1-\pi)}{n}\right)\]

\[\frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\]

These three results underpin all of Chapters 6–8 (estimation and hypothesis testing). Master them before moving on.

Point Estimation

Bias; variance of an estimator; Mean Squared Error (MSE); comparing estimators; MVUE; unbiasedness of common estimators.

Bias, variance and MSE

\[\text{Bias}(\hat{\theta}) = E(\hat{\theta}) - \theta\]

\[\text{MSE}(\hat{\theta}) = \text{Var}(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2\]

\[\text{Unbiased: } E(\hat{\theta}) = \theta \implies \text{Bias} = 0 \implies \text{MSE} = \text{Var}(\hat{\theta})\]

Types of bias

Bias value	Type	Meaning
> 0	Positive bias	Estimator systematically overestimates θ
= 0	Unbiased	Correct on average over repeated samples
< 0	Negative bias	Estimator systematically underestimates θ

Bias is the systematic component of error. Variance is the random component. MSE captures both.

Standard unbiased estimators

Parameter	Unbiased estimator	Result
μ (mean)	\(\bar{X}\)	\(E(\bar{X}) = \mu\)
π (proportion)	P = R/n	E(P) = π
σ² (variance)	S²	E(S²) = σ²

This is why we divide by n−1 in the sample variance formula — it makes S² unbiased for σ². Dividing by n would give a biased (downward) estimator.

Sampling error

\[\text{Sampling error} = \hat{\theta} - \theta\]

The sampling error from one specific sample. Unknown in practice (since θ is unknown), but its distribution is the sampling distribution studied in Chapter 5.

Comparing estimators — worked example

For estimating μ with three estimators from a sample of size n:

\[T_1 = \bar{X}, \quad T_2 = \frac{X_1 + X_n}{2}, \quad T_3 = \bar{X} + 3\]

Estimator	Bias	Variance	MSE
T₁ = X̄	0	σ²/n	σ²/n
T₂ = (X₁+Xₙ)/2	0	σ²/2	σ²/2
T₃ = X̄ + 3	3	σ²/n	σ²/n + 9

T₃ is always worse than T₁ — same variance but adds bias, so MSE is larger.
T₁ beats T₂ when n > 2 — T₁ uses all observations; T₂ only uses the first and last.
A biased estimator is not automatically worse — compare MSE, not just bias.

MVUE — minimum variance unbiased estimator

Among all unbiased estimators of θ, the MVUE has the smallest variance.

\[\text{For unbiased estimators: MSE} = \text{Var}(\hat{\theta})\]

So among unbiased estimators, smallest variance → smallest MSE → MVUE.

Relative efficiency

If θ̂₁ and θ̂₂ are both unbiased estimators of θ, and Var(θ̂₁) < Var(θ̂₂), then θ̂₁ is more efficient than θ̂₂.

\[\text{Relative efficiency} = \frac{\text{Var}(\hat{\theta}_2)}{\text{Var}(\hat{\theta}_1)}\]

A relative efficiency > 1 means θ̂₁ is more efficient.

Unbiasedness is not invariant under transformations

This is a subtle but frequently tested point.

If θ̂ is unbiased for θ, it does NOT follow that θ̂² is unbiased for θ².

\[E(\hat{\theta}^2) = \text{Var}(\hat{\theta}) + [E(\hat{\theta})]^2 = \text{Var}(\hat{\theta}) + \theta^2 > \theta^2\]

Since Var(θ̂) > 0, E(θ̂²) > θ², so θ̂² overestimates θ².

Practical implication

S² is unbiased for σ². But S = √S² is NOT unbiased for σ. The transformation breaks unbiasedness. Similarly, X̄ is unbiased for μ, but 1/X̄ is generally not unbiased for 1/μ.

Exam methodology — comparing estimators

1
Find E(T) for each estimator to determine bias: Bias = E(T) − θ.
2
Find Var(T) for each estimator.
3
Compute MSE = Var(T) + [Bias(T)]².
4
Choose the estimator with the smallest MSE.
5
If both are unbiased, choose the one with smaller variance (= smaller MSE).

Common exam traps

A biased estimator can be preferred if its variance is much smaller (lower MSE).
Do not confuse "unbiased" with "good" — a high-variance unbiased estimator can be worse than a low-bias biased one.
T = X̄ uses all n observations. Estimators using only a few observations (like T₂ above) are generally less efficient.

MSE decomposition — visual intuition

Variance component

How widely the estimator spreads around its expected value. Captured by Var(θ̂). Reduced by larger sample size.

Bias² component

How far the estimator's expected value is from the true θ. Captured by [Bias]². Not reduced by larger n (systematic error).

Think of a target: variance is how scattered the shots are; bias is how far from the bullseye the centre of the shots is. MSE captures both simultaneously.

Interval Estimation

Confidence intervals; t-distribution; CI for mean; CI for proportion; sample size determination; difference between two means; difference between two proportions; paired samples.

Confidence interval — structure

\[\hat{\theta} \pm z_{\alpha/2} \times \text{S.E.}(\hat{\theta}) \quad \text{(σ known)}\]

\[\hat{\theta} \pm t_{\alpha/2,\,\nu} \times \text{E.S.E.}(\hat{\theta}) \quad \text{(σ unknown)}\]

Common multipliers

Confidence level	z multiplier
90%	1.645
95%	1.96
99%	2.576

Higher confidence → wider interval. Lower confidence → narrower but less reliable interval.

Correct interpretation

A 95% CI does NOT mean there is 95% probability that this specific interval contains θ. Once calculated, θ is either in it or not. Correct: if we repeated the sampling many times and built a CI each time, about 95% of those intervals would contain θ.

All confidence interval formulas

Single mean, σ known

\[\bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}\]

Single mean, σ unknown (use t_{n-1})

\[\bar{x} \pm t_{\alpha/2,\,n-1} \frac{s}{\sqrt{n}}\]

Single proportion

\[p \pm z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}}\]

Difference in means, σ known

\[(\bar{x}_1 - \bar{x}_2) \pm z_{\alpha/2}\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}\]

Difference in means, equal variance pooled (t_{n1+n2-2})

\[s_p^2 = \frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{n_1+n_2-2}\]

\[(\bar{x}_1-\bar{x}_2) \pm t_{\alpha/2,n_1+n_2-2}\sqrt{s_p^2\!\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}\]

Paired samples (reduce to single mean)

\[d_i = x_i - y_i, \quad \bar{d} \pm t_{\alpha/2,\,n-1}\frac{s_d}{\sqrt{n}}\]

Sample size determination

For estimating a mean

\[e = z_{\alpha/2}\frac{\sigma}{\sqrt{n}} \implies n \ge \frac{z_{\alpha/2}^2\,\sigma^2}{e^2}\]

For estimating a proportion

\[n \ge \frac{z_{\alpha/2}^2\,p(1-p)}{e^2}\]

If no pilot estimate for p: use p = 0.5 (maximises p(1−p) = 0.25, giving the most conservative/largest sample size).

Always round UP — even 96.04 becomes 97. Rounding down would make the margin of error too large.

Example

95% CI, σ = 0.05, error ≤ 0.01:

\[n \ge \frac{1.96^2 \times 0.05^2}{0.01^2} = \frac{0.0038}{0.0001} = 96.04 \implies n = 97\]

Student's t-distribution

When σ is unknown, we estimate it with s, and use:

\[\frac{\hat{\theta} - \theta}{\text{E.S.E.}(\hat{\theta})} \sim t_\nu\]

The t distribution is bell-shaped and symmetric like the normal, but has fatter tails. As ν → ∞, t_ν → N(0,1).

Degrees of freedom

Situation	Degrees of freedom ν
Single mean	n − 1
Paired samples	n − 1
Two means, equal variance	n₁ + n₂ − 2
Regression slope/intercept	n − 2

Each estimated parameter costs one degree of freedom.

Difference between two proportions

\[\text{E.S.E.}(P_1-P_2) = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}\]

\[(p_1 - p_2) \pm z_{\alpha/2}\sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}\]

Interpreting the interval

Interval contains 0

No significant evidence that π₁ ≠ π₂. The difference is consistent with zero.

Interval excludes 0

Evidence that the two proportions differ significantly.

Paired vs independent samples

Use paired when the same subject is measured twice (e.g. before/after treatment). Use independent when different subjects are in each group. Pairing reduces variability by removing between-subject differences.

z vs t — when to use which

Situation	Distribution	Why
σ known	z (standard normal)	Exact result from sampling theory
σ unknown, estimate with s	t_{n-1}	Extra uncertainty from estimating σ
Large n, σ unknown	z (approximately)	t_ν → N(0,1) as ν → ∞
Two means, equal variances	t_{n1+n2-2}	Pooled variance estimated

Common exam traps — Ch.7

Confidence interval interpretation: do NOT say "95% probability that μ is in this interval." The interval is either right or wrong — say "if we repeated this procedure, 95% of intervals constructed would contain μ."
Round sample size UP: n = 96.04 requires n = 97, not 96.
Use conservative p = 0.5 for sample size when no pilot estimate is available.
z vs t: use t when σ is estimated from data. Use z only when σ is known exactly.
Degrees of freedom: single mean → n−1; two means (equal variance) → n₁+n₂−2; paired → n−1 (treat as single sample of differences).
Wider confidence → larger z multiplier → wider interval. 99% CI is always wider than 95% CI.

Hypothesis Testing

H₀ and H₁; Type I and II errors; critical region; p-values; tests for mean; tests for proportion; two-sample tests; paired tests.

Hypothesis testing framework

Hypothesis testing chooses between two competing statements about a population parameter.

H₀ — null hypothesis

Always contains equality (=, ≤, ≥). Represents "no effect", "no difference", "no improvement". Assumed true initially.

H₁ — alternative hypothesis

What you are looking for evidence of. Determines the type of test. Does NOT contain equality.

Types of test

H₁ wording	Test type	Critical region
"Different from" (≠)	Two-tailed	Both tails
"Greater than" (>)	Upper-tailed	Right tail only
"Less than" (<)	Lower-tailed	Left tail only

If direction is unspecified, default to two-tailed.

All hypothesis test formulas

Single mean, σ known

\[Z = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}} \sim N(0,1)\]

Single mean, σ unknown

\[T = \frac{\bar{X} - \mu_0}{s/\sqrt{n}} \sim t_{n-1}\]

Single proportion

\[Z = \frac{P - \pi_0}{\sqrt{\pi_0(1-\pi_0)/n}} \sim N(0,1)\]

Two proportions (pooled under H₀)

\[\hat{p} = \frac{r_1+r_2}{n_1+n_2}, \quad Z = \frac{p_1-p_2}{\sqrt{\hat{p}(1-\hat{p})(1/n_1+1/n_2)}}\]

Two means, variances known

\[Z = \frac{(\bar{X}_1-\bar{X}_2)-(\mu_1-\mu_2)}{\sqrt{\sigma_1^2/n_1+\sigma_2^2/n_2}}\]

Paired samples

\[d_i = x_i - y_i, \quad T = \frac{\bar{d}}{s_d/\sqrt{n}} \sim t_{n-1}\]

Type I and Type II errors

True state	Decision	Outcome
H₀ true	Don't reject H₀	✓ Correct
H₀ true	Reject H₀	✗ Type I error (α)
H₁ true	Don't reject H₀	✗ Type II error (β)
H₁ true	Reject H₀	✓ Correct (Power = 1−β)

\[\alpha = P(\text{Type I error}) = P(\text{reject } H_0 \mid H_0 \text{ true})\]

\[\text{Power} = 1 - \beta = P(\text{reject } H_0 \mid H_0 \text{ false})\]

Decreasing α (making it harder to reject H₀) increases β. There is a trade-off between the two error types.

Critical values table

Two-tailed (z_{α/2})

α	Critical values
10%	±1.645
5%	±1.96
1%	±2.576

Upper-tailed (z_α)

α	Critical value
10%	1.282
5%	1.645
1%	2.326

Lower-tailed

Mirror of upper-tailed: ×(−1). e.g., 5% lower-tail → −1.645.

P-value method

Test type	P-value formula
Two-tailed	\(2P(Z \ge \|z_{obs}\|)\)
Upper-tailed	\(P(Z \ge z_{obs})\)
Lower-tailed	\(P(Z \le z_{obs})\)

Reject H₀ if p-value < α. Do not reject if p-value ≥ α.

Proportion test — key distinction

For testing H₀: π = π₀, the test statistic uses π₀ in the denominator, not p:

\[Z = \frac{P - \pi_0}{\sqrt{\pi_0(1-\pi_0)/n}}\]

Why? Because under H₀, we assume π = π₀ is true, so we use the null value π₀ in the standard error. This differs from the confidence interval, which uses p in the ESE.

Two-proportion test — pooled proportion

Under H₀: π₁ = π₂, both samples estimate the same common proportion π. Estimate it using:

\[\hat{p} = \frac{r_1 + r_2}{n_1 + n_2}\]

Use this pooled estimate in the test statistic standard error. Do NOT use p₁ and p₂ separately in the denominator when testing H₀: π₁ = π₂.

Hypothesis testing — step-by-step method

1
State H₀ and H₁ clearly. H₀ always has equality.
2
Choose significance level α (usually 5%).
3
Identify the appropriate test statistic and its distribution under H₀.
4
Calculate the observed test statistic from the data.
5
Find the critical value(s) or compute the p-value.
6
Make the decision: reject or fail to reject H₀.
7
State the conclusion in context — never just say "reject H₀" without interpretation.

Common exam traps — Ch.8

Never "accept H₀": the correct phrase is "fail to reject H₀" or "there is insufficient evidence to reject H₀." Failure to reject is not the same as proving H₀ is true.
H₀ always has equality. Never write H₀: μ ≠ μ₀. The null is always μ = μ₀ (or ≤ or ≥).
Proportion test: use π₀ in SE. Using p in the test statistic denominator is wrong — use π₀.
Two-proportion: use pooled p̂. Under H₀: π₁ = π₂, pool both samples for the SE.
P-value is not α. The p-value is the probability of the observed (or more extreme) result under H₀. If p < α, reject.
Two-tailed p-value = 2 × one-tailed. For a two-tailed test, multiply the one-tail probability by 2.

Contingency Tables & Chi-Squared Tests

Association vs correlation; observed and expected frequencies; chi-squared test statistic; goodness-of-fit; degrees of freedom; interpreting results.

Correlation vs association

Correlation

Both variables are measurable (continuous or discrete). Use Pearson's r. Example: height and weight.

Association

Both variables are categorical. Use chi-squared test. Example: hair colour and eye colour. If one variable is measurable, convert it into categories first (e.g., age → age group).

Hypotheses for association test

\[H_0: \text{the two factors are independent (not associated)}\]

\[H_1: \text{the two factors are associated (not independent)}\]

Chi-squared test formulas

Expected frequencies (under H₀)

\[E_{ij} = \frac{\text{row } i \text{ total} \times \text{column } j \text{ total}}{\text{grand total}}\]

Test statistic

\[\chi^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\]

Degrees of freedom

\[\nu = (r-1)(c-1) \quad \text{(association test)}\]

\[\nu = k - 1 \quad \text{(goodness-of-fit, } k \text{ categories)}\]

\[\nu = k - 1 - m \quad \text{(if } m \text{ parameters estimated from data)}\]

Decision rule

\[\text{Reject } H_0 \text{ if } \chi^2 > \chi^2_{\alpha,\nu} \quad \text{(always upper-tailed)}\]

Contingency table — worked structure

	Cat. 1	Cat. 2	Cat. 3	Row total
Group A	O₁₁	O₁₂	O₁₃	R₁
Group B	O₂₁	O₂₂	O₂₃	R₂
Col. total	C₁	C₂	C₃	N (grand)

\[E_{11} = \frac{R_1 \times C_1}{N}\]

Condition for validity

All expected frequencies E_ij must be ≥ 5 for the chi-squared approximation to be reliable. If some cells have E < 5, merge adjacent categories or use Fisher's exact test.

Chi-squared logic — why it works

Under H₀ (independence), observed and expected frequencies should be similar. The test statistic measures the total discrepancy:

If O_ij ≈ E_ij for all cells → χ² is small → no evidence against H₀
If O_ij differs substantially from E_ij → χ² is large → evidence against H₀

Chi-squared values are always ≥ 0 because they square the differences. So the test is always upper-tailed — large values reject H₀.

Identifying which cells drive the association

After rejecting H₀, look at per-cell contributions to χ²:

\[\text{Cell contribution} = \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\]

Cells with large contributions explain the nature of the association. Example: if area A has much higher burglary than expected, burglary is the main problem in area A.

Goodness-of-fit test

Tests whether observed data follow a specified distribution.

\[H_0: \text{data follow stated distribution}\]

\[H_1: \text{data do not follow stated distribution}\]

Procedure

1
Specify the hypothesised distribution and calculate expected frequencies E_i = n × P(category i under H₀).
2
Calculate χ² = Σ(O_i − E_i)² / E_i.
3
Degrees of freedom = k − 1 − (number of parameters estimated from data).
4
Compare to χ²_{α,ν} — upper-tail critical value. Reject if χ² exceeds this.

Example: fair die

H₀: die is fair → E_i = n/6 for each face. ν = 6 − 1 = 5.

Common exam traps — Ch.9

Chi-squared is always upper-tailed. Large χ² = evidence against H₀. There is no lower-tailed chi-squared test for association.
Expected frequencies must be ≥ 5. If not, merge categories before computing the test statistic.
Use the correct degrees of freedom: (r−1)(c−1) for association; k−1 for goodness-of-fit (minus any estimated parameters).
Correlation uses r, association uses χ². Using Pearson's r for categorical data is wrong.
After rejection, interpret which cells drive the result — don't just say "there is association" without identifying where.

Chi-squared distribution properties

\[E(\chi^2_k) = k, \quad \text{Var}(\chi^2_k) = 2k\]

Always non-negative (chi-squared cannot be negative)
Right-skewed (especially for small k)
Approaches normality as k increases
Depends only on degrees of freedom k

For the association test, ν = (r−1)(c−1). For a 2×2 table: ν = 1. For a 3×3 table: ν = 4. Larger tables need larger χ² to reject H₀.

Correlation & Linear Regression

Scatterplots; Pearson's r; covariance; simple linear regression; least squares; R²; residual variance; inference for slope; prediction; residual analysis.

Correlation and sum-of-squares

Shortcut formulas

\[S_{xx} = \sum x_i^2 - \frac{(\sum x_i)^2}{n}\]

\[S_{yy} = \sum y_i^2 - \frac{(\sum y_i)^2}{n}\]

\[S_{xy} = \sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n}\]

Sample correlation coefficient

\[r = \frac{S_{xy}}{\sqrt{S_{xx} S_{yy}}}, \quad -1 \le r \le 1\]

Value of r	Interpretation
Near +1	Strong positive linear relationship
Near −1	Strong negative linear relationship
Near 0	Weak / no linear relationship

Simple linear regression — model and estimation

\[Y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]

Least squares estimators

\[\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}\]

\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]

Fitted value and residual

\[\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i \qquad \hat{\epsilon}_i = y_i - \hat{y}_i\]

Interpreting coefficients

Slope \(\hat{\beta}_1\)

For each 1-unit increase in x, predicted y changes by \(\hat{\beta}_1\) on average. Sign indicates direction of relationship.

Intercept \(\hat{\beta}_0\)

Predicted y when x = 0. May not be meaningful if x = 0 is outside the data range.

Variation decomposition and R²

\[TSS = S_{yy} \quad \text{(total sum of squares)}\]

\[ESS = \frac{S_{xy}^2}{S_{xx}} \quad \text{(explained / regression SS)}\]

\[RSS = S_{yy} - \frac{S_{xy}^2}{S_{xx}} \quad \text{(residual SS)}\]

\[TSS = ESS + RSS\]

\[R^2 = \frac{ESS}{TSS} = \frac{S_{xy}^2/S_{xx}}{S_{yy}}\]

In simple linear regression only: \(R^2 = r^2\). This equivalence does NOT hold in multiple regression.

Interpretation: R² = 0.72 means 72% of the variation in y is explained by the linear relationship with x. The remaining 28% is unexplained (random noise).

Residual variance and inference

\[s^2 = \frac{RSS}{n-2} \quad \text{(residual variance)}\]

Uses n−2 degrees of freedom because two parameters (β₀ and β₁) are estimated.

Standard errors of estimators

\[\text{E.S.E.}(\hat{\beta}_1) = \frac{s}{\sqrt{S_{xx}}}\]

\[\text{E.S.E.}(\hat{\beta}_0) = s\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}}\]

Hypothesis test for slope

\[H_0: \beta_1 = 0 \quad H_1: \beta_1 \ne 0\]

\[T = \frac{\hat{\beta}_1}{s/\sqrt{S_{xx}}} \sim t_{n-2}\]

Rejecting H₀ means evidence of a significant linear relationship between x and y.

Confidence interval for slope

\[\hat{\beta}_1 \pm t_{\alpha/2,\,n-2}\,\frac{s}{\sqrt{S_{xx}}}\]

If the CI contains 0, the slope may not differ significantly from zero.

Prediction and extrapolation

\[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 x_0\]

Prediction becomes less reliable as x₀ moves further from x̄. Prediction variance increases with distance from x̄.

Avoid extrapolation: do not make predictions outside the observed range of x. The linear relationship may not continue, and the regression model was not built to describe those regions.

Residual analysis

\[\hat{\epsilon}_i = y_i - \hat{y}_i\]

Standardised residuals above 2 are suspicious. Values above 3 strongly suggest an outlier. Residual plots help detect:

Outliers — unusual individual points
Non-linearity — curved pattern in residuals
Heteroscedasticity — increasing/decreasing spread of residuals

Correlation vs regression — key distinctions

Feature	Correlation	Regression
Purpose	Measure strength of linear relationship	Model and predict y from x
Symmetric?	Yes (r is same for x,y or y,x)	No (regress y on x ≠ regress x on y)
Units?	Dimensionless (−1 to 1)	β₁ in units of y per unit of x
Scale invariant?	Yes	No — rescaling x changes β₁

Correlation does not imply causation. Ice cream sales and sunscreen sales are positively correlated — both driven by warm weather. A significant slope in regression also does not prove causation.

Common exam traps — Ch.10

R² measures explained variation, not causation. A high R² doesn't mean x causes y.
In SLR only: R² = r². This equivalence fails in multiple regression.
Zero correlation ≠ no relationship. There could be a non-linear relationship that r misses entirely.
Avoid extrapolation: predictions outside the observed x-range are unreliable.
Regression df = n−2: two parameters estimated (β₀ and β₁) → n−2 degrees of freedom for residual variance.
Intercept interpretation: only meaningful if x = 0 is within the data range. Often just a mathematical necessity with no real interpretation.

Exam methodology — regression question

1
Compute S_xx, S_yy, S_xy using shortcut formulas. Need: Σx, Σy, Σx², Σy², Σxy, and n.
2
Slope: β̂₁ = S_xy / S_xx. Intercept: β̂₀ = ȳ − β̂₁ x̄.
3
Compute RSS = S_yy − S_xy²/S_xx. Then s² = RSS/(n−2).
4
R² = ESS/TSS = (S_xy²/S_xx) / S_yy.
5
For inference on slope: compute ESE(β̂₁) = s/√S_xx, then T = β̂₁/ESE, compare to t_{n-2}.
6
For prediction: substitute x₀ into the fitted equation. State limitations if x₀ is far from x̄.

✓

Final Exam Map — Everything to Know Cold

All key equations; chapter-by-chapter distinctions; critical values; 5-step template.

Essential equations

\[\bar{x}=\frac{\sum x_i}{n},\quad S_{xx}=\sum x_i^2-n\bar{x}^2,\quad s^2=\frac{S_{xx}}{n-1}\]

\[\text{Freq.density}=\frac{\text{Freq}}{\text{Width}},\quad IQR=Q_3-Q_1\]

\[P(A|B)=\frac{P(A\cap B)}{P(B)},\quad P(A|B)=\frac{P(B|A)P(A)}{P(B)}\]

\[{}^nC_r=\frac{n!}{r!(n-r)!},\quad {}^nP_r=\frac{n!}{(n-r)!}\]

\[P(X=x)=\binom{n}{x}\pi^x(1-\pi)^{n-x},\quad P(X=x)=\frac{e^{-\lambda}\lambda^x}{x!}\]

\[\bar{X}\sim N\!\left(\mu,\frac{\sigma^2}{n}\right),\quad P\sim N\!\left(\pi,\frac{\pi(1-\pi)}{n}\right),\quad \frac{(n-1)S^2}{\sigma^2}\sim\chi^2_{n-1}\]

\[\text{MSE}(\hat\theta)=\text{Var}(\hat\theta)+[\text{Bias}(\hat\theta)]^2\]

\[\bar{x}\pm t_{\alpha/2,n-1}\frac{s}{\sqrt{n}},\quad p\pm z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}},\quad n\ge\frac{z_{\alpha/2}^2\sigma^2}{e^2}\]

\[Z=\frac{\bar{X}-\mu_0}{\sigma/\sqrt{n}},\quad T=\frac{\bar{X}-\mu_0}{s/\sqrt{n}},\quad Z=\frac{P-\pi_0}{\sqrt{\pi_0(1-\pi_0)/n}}\]

\[E_{ij}=\frac{R_i\times C_j}{N},\quad \chi^2=\sum\frac{(O-E)^2}{E},\quad \nu=(r-1)(c-1)\]

\[r=\frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}},\quad \hat\beta_1=\frac{S_{xy}}{S_{xx}},\quad \hat\beta_0=\bar{y}-\hat\beta_1\bar{x}\]

\[RSS=S_{yy}-\frac{S_{xy}^2}{S_{xx}},\quad s^2=\frac{RSS}{n-2},\quad R^2=r^2\text{ (SLR only)}\]

Key distinctions — chapter by chapter

C1: Freq. density on histogram y-axis when widths unequal. Mean > median = right skew. Outlier > 1.5×IQR from box.
C2: Mutually exclusive ≠ independent. P(A|B) denominator = P(B). At least one → complement.
C3: "At least x" = 1−F(x−1). Binomial: n+1 values. Scale λ to units. Poisson: mean=variance=λ.
C4: P(X=x)=0 for continuous. f(x) is density not probability. Standardise X̄ with σ/√n. CLT → X̄ only.
C5: S.E.(X̄)=σ/√n. Use π not p in Var(P). S² needs chi-squared.
C6: MSE=Var+Bias². Biased can be better if MSE lower. Unbiasedness not preserved under transforms.
C7: CI = long-run coverage. Round n UP. Use p=0.5 if no pilot. z (σ known); t_{n-1} (σ unknown).
C8: H₀ has equality. Never "accept H₀." Proportion: use π₀ in SE. Two-proportion: pool p̂.
C9: Chi-squared always upper-tailed. E_ij ≥ 5. df=(r−1)(c−1).
C10: r=0 ≠ no relationship. Corr ≠ causation. R²=r² SLR only. df=n−2. No extrapolation.

Critical values

Level	z two-tail	z upper
10%	±1.645	1.282
5%	±1.96	1.645
1%	±2.576	2.326

5-step answer template

State H₀, H₁ and distribution under H₀.

Check all conditions for the chosen method.

Calculate test statistic or CI — show full working.

Compare to critical value or p-value vs α.

Conclude in context — state what the result means.

Example — proportion test conclusion

"H₀: π=0.3, H₁: π≠0.3. Use π₀=0.3 in SE (not p). Z=(p−0.3)/√(0.3×0.7/n). α=5% two-tailed → reject if |Z|>1.96. Conclude: at 5% significance there is [sufficient/insufficient] evidence that the true proportion differs from 0.3."

Key Terms Glossary

Every important term from ST107 — searchable and filterable by chapter.

Frequency density

Frequency divided by class width. Must be used on a histogram y-axis when class widths differ. In a histogram, area (not height) represents frequency. A wider bar that looks tall does not contain more data — frequency density corrects for this.

Descriptive Stats

Corrected sum of squares S_xx

S_xx = Σ(x_i − x̄)² = Σx_i² − nx̄². The shortcut avoids computing each deviation individually. Foundation for sample variance s², and for regression statistics S_xy and S_yy.

Descriptive Stats

Sample variance s²

s² = S_xx/(n−1). Divides by n−1 (not n) to be an unbiased estimator of σ². Units are squared — use s = √s² for interpretation in original units. Why n−1? Because deviations are measured from x̄ (estimated), not from the true μ.

Descriptive Stats

IQR and outlier rule

IQR = Q₃ − Q₁, the spread of the middle 50%. Outlier if > 1.5×IQR below Q₁ or above Q₃. More robust than range because it ignores the extremes. Extreme outlier threshold: > 3×IQR from box.

Descriptive Stats

Skewness — positive and negative

Positive/right skew: mean > median, long tail to right. Negative/left skew: mean < median, long tail to left. Symmetric: mean = median. The mean is pulled toward the long tail by extreme values, while the median is resistant to outliers.

Descriptive Stats

Boxplot

Shows five key features: median (centre line), Q₁ (bottom of box), Q₃ (top of box), whiskers (furthest non-outlier values), and outliers (separate points more than 1.5×IQR from the box). Enables rapid visual reading of location, spread, skewness, and outliers.

Descriptive Stats

Modal class vs modal value

Modal value: the single most frequently occurring raw observation (ungrouped data). Modal class: the class interval with the highest frequency (grouped data). For grouped data, only the modal class can be identified — the exact modal value is unknown.

Descriptive Stats

Sample space S

The complete set of all possible outcomes of an experiment. P(S) = 1. An event is any subset of S. Every probability calculation is relative to S — the probability of an event can never exceed 1.

Probability

Mutually exclusive events

Events that cannot both occur in the same trial: P(A ∩ B) = 0. Example: rolling even AND odd on one die. If A occurs, B definitely did not. Mutually exclusive events with P(A)>0 and P(B)>0 are NOT independent — they are actually dependent.

Probability

Independent events

One event gives no information about the other: P(A|B) = P(A), equivalently P(A∩B) = P(A)P(B). Example: two separate die rolls. Independence is about information content, not about whether events can co-occur.

Probability

Conditional probability P(A|B)

The probability of A given that B has occurred: P(A|B) = P(A∩B)/P(B). The conditioning event B becomes the restricted sample space. Denominator is ALWAYS P(B) — the most common error is using P(A) in the denominator instead.

Probability

Bayes' theorem

Reverses conditional probability: P(A|B) = P(B|A)P(A)/P(B). The denominator P(B) is usually found using the total probability formula first. Classic example: P(disease|positive test) can be surprisingly small when disease prevalence (base rate) is low.

Probability

Permutation ⁿPᵣ

An ordered arrangement of r objects from n: ⁿPᵣ = n!/(n−r)!. Use when order/position/rank matters: arrange, queue, first/second/third. Example: 3 prizes from 10 people → ¹⁰P₃ = 720.

Probability

Combination ⁿCᵣ

A selection of r from n where order is irrelevant: ⁿCᵣ = n!/[r!(n−r)!]. Use when only who is selected matters: choose, committee, group, hand. Example: committee of 4 from 9 → ⁹C₄ = 126.

Probability

Total probability formula

P(A) = Σ P(A|Bᵢ)P(Bᵢ) for a partition B₁,…,Bₙ (mutually exclusive and collectively exhaustive). Sums over all routes that lead to A. Usually the first step before applying Bayes' theorem.

Probability

Binomial distribution

X ~ Bin(n,π): number of successes in n independent Bernoulli trials with constant success probability π. Four conditions: fixed n, fixed π, two outcomes, independence. Mean=nπ, Var=nπ(1−π). X takes n+1 values: 0,1,…,n. P(X=x) = C(n,x)πˣ(1−π)ⁿ⁻ˣ.

Discrete Dists

Poisson distribution

X ~ Pois(λ): random events in a continuous medium (time, area, volume). P(X=x) = e⁻λλˣ/x!. Mean = variance = λ — this equality is the defining characteristic. No upper bound on x. Always scale λ to match the unit in the question before computing.

Discrete Dists

CDF and probability wording

F(x) = P(X≤x). Key translations: "at most x" = F(x); "fewer than x" = F(x−1); "at least x" = 1−F(x−1); "more than x" = 1−F(x). The "at least x" case uses x−1 in the CDF — this is the most commonly confused. Recover: P(X=x) = F(x)−F(x−1).

Discrete Dists

Probability density function (PDF)

f(x): describes the relative likelihood of continuous outcomes. NOT a probability — f(x) can exceed 1. Probability = area under f(x): P(a

Continuous Dists

Standard normal Z ~ N(0,1)

All normal probabilities found by standardising: Z=(X−μ)/σ for individual X; Z=(X̄−μ)/(σ/√n) for sample mean. Tables give Φ(z)=P(Z≤z). Symmetry: P(Z<−z)=1−Φ(z). Most critical difference from individual standardisation: dividing by σ/√n, not σ.

Continuous Dists

Central Limit Theorem

For large n (≥30 rule of thumb), X̄ ≈ N(μ, σ²/n) regardless of the population distribution. Exact if population is normal. CLT applies to the SAMPLE MEAN X̄ — not to individual observations. More skewed populations need larger n for the approximation to be good.

Continuous Dists

Exponential distribution Exp(λ)

Models waiting times. f(x)=λe⁻λˣ, F(x)=1−e⁻λˣ. Mean=1/λ, Var=1/λ². Memoryless: P(X>s+t|X>s) = P(X>t). Paired with Poisson: if arrivals are Pois(λ) events per unit time, inter-arrival times are Exp(λ).

Continuous Dists

Standard error S.E.

The standard deviation of an estimator — its variability across repeated samples. S.E.(X̄)=σ/√n; S.E.(P)=√[π(1−π)/n]. Decreases with larger n. Distinct from E.S.E. (estimated S.E.) which substitutes s or p for unknown σ or π.

Sampling

Chi-squared distribution χ²_k

Non-negative, right-skewed. E(χ²_k)=k, Var(χ²_k)=2k. Key result: (n−1)S²/σ² ~ χ²_{n-1} when sampling from normal population. Also the test statistic distribution for association and goodness-of-fit tests. Critical values from UPPER tail only.

Sampling

Sampling distribution

The probability distribution of a statistic over all possible samples of size n. Describes how an estimator varies from sample to sample. Three key sampling distributions: X̄ ~ N(μ,σ²/n); P ~ N(π,π(1−π)/n); (n−1)S²/σ² ~ χ²_{n-1}.

Sampling

Bias

Bias(θ̂) = E(θ̂)−θ. Systematic error — how far the estimator's average is from the truth. Positive: overestimates. Zero: unbiased. Bias does not decrease with larger n (unlike variance). A biased estimator can be preferred if its MSE is smaller.

Estimation

Mean Squared Error (MSE)

MSE(θ̂) = Var(θ̂) + [Bias(θ̂)]². Captures both random variability and systematic error. For unbiased estimators, MSE = Var. The preferred criterion for comparing estimators. A low-variance biased estimator can have smaller MSE than a high-variance unbiased one.

Estimation

Confidence interval

An interval (L,U) from sample data such that, if the sampling procedure is repeated many times, a specified percentage (e.g. 95%) of constructed intervals would contain the true parameter. NOT "there is 95% probability this specific interval contains θ" — once constructed, it either does or doesn't.

Estimation

Student's t-distribution

Used when σ is unknown and replaced by s. Bell-shaped, symmetric, but fatter tails than N(0,1). As ν→∞, t_ν→N(0,1). Always use t_{n-1} (not z) when estimating σ from data. df = n−1 for single sample; n₁+n₂−2 for two-sample equal-variance.

Estimation

MVUE

Minimum Variance Unbiased Estimator: the unbiased estimator with the smallest variance (= smallest MSE). X̄ is MVUE for μ; S² for σ²; P for π (large n). More efficient than estimators using fewer observations (like the average of just first and last values).

Estimation

Null hypothesis H₀

The default "no effect" statement — always contains equality (=, ≤, ≥). Assumed true; evidence must be strong to overturn it. Test statistic distribution is derived under H₀. Examples: H₀: μ=50; H₀: π=0.3; H₀: π₁=π₂. Never "accept H₀" — only "fail to reject" it.

Hypothesis Tests

Type I / Type II errors

Type I error (α): rejecting H₀ when it is true (false positive). Type II error (β): failing to reject H₀ when it is false (false negative). Power = 1−β = P(correctly reject false H₀). Trade-off: reducing α increases β. Power increases with larger n.

Hypothesis Tests

P-value

P(observing result as extreme or more extreme | H₀ true). Reject H₀ if p-value < α. NOT the probability that H₀ is true. Two-tailed p = 2 × one-tail probability. Smaller p-value = stronger evidence against H₀.

Hypothesis Tests

Pooled proportion p̂ (two-proportion test)

Under H₀: π₁=π₂, estimate the common proportion using: p̂=(r₁+r₂)/(n₁+n₂). Use p̂ in the test statistic SE. Do NOT use p₁ and p₂ separately — the test assumes they are equal under H₀. Contrast: CI uses p₁ and p₂ separately in the ESE.

Hypothesis Tests

Paired samples test

When the same subject is measured twice, form dᵢ=xᵢ−yᵢ and apply a one-sample t-test on the differences. H₀: μ_d=0. T=d̄/(s_d/√n) ~ t_{n-1}. More powerful than independent two-sample test when pairing is appropriate, because between-subject variability is removed.

Hypothesis Tests

Pearson's r

r = S_xy/√(S_xx S_yy). Ranges from −1 to +1. Dimensionless and scale-invariant. r=0 does NOT mean no relationship — could be strong non-linear relationship. Measures linear association only. In simple linear regression: R²=r².

Regression

Least squares regression

Finds ŷ=β̂₀+β̂₁x minimising Σ(yᵢ−ŷᵢ)². Slope β̂₁=S_xy/S_xx. Intercept β̂₀=ȳ−β̂₁x̄. Line always passes through (x̄,ȳ). Slope interpretation: predicted y changes by β̂₁ for each 1-unit increase in x.

Regression

R² (coefficient of determination)

Proportion of total variation in y explained by the regression: R²=ESS/TSS. In SLR only: R²=r². R²=0.72 means 72% of variation in y is explained. Does NOT imply causation; does NOT capture non-linear fit quality.

Regression

Residual ε̂_i

Difference between observed and fitted y: ε̂ᵢ=yᵢ−ŷᵢ. Σε̂ᵢ=0 for OLS. Standardised residual > 2: suspicious; > 3: likely outlier. Residual plots detect: outliers, non-linearity, heteroscedasticity.

Regression

Chi-squared association test

Tests H₀: two categorical variables are independent. χ²=Σ(O−E)²/E, E_ij=(row_i total×col_j total)/N. df=(r−1)(c−1). Always upper-tailed. Requires all E_ij≥5. Large χ² → observed counts far from independence prediction → reject H₀.

Regression

Key Equations & Formulas

Every formula from ST107, grouped by chapter, with worked examples and exam notes.

Chapter 1 — Descriptive Statistics

Sample mean

\[\bar{x}=\frac{\sum x_i}{n}\]

Frequency data: x̄=Σf_kx_k/Σf_k. Mean is pulled by outliers; median is not. Mean > median → right skew.

Corrected sum of squares

\[S_{xx}=\sum x_i^2-n\bar{x}^2\]

Shortcut avoids computing (xᵢ−x̄)² one by one. Used in s², regression slope, correlation coefficient.

Sample variance

\[s^2=\frac{S_{xx}}{n-1}\]

Divides by n−1 for unbiasedness. Units are squared. s=√s² is in original units for interpretation.

Frequency density

\[\text{Freq.density}=\frac{\text{Frequency}}{\text{Class width}}\]

REQUIRED on histogram y-axis when class widths differ. Area = frequency. Wider class not necessarily more data.

IQR and outlier boundary

\[IQR=Q_3-Q_1,\quad \text{Outlier if} > 1.5 \times IQR \text{ from box}\]

Lower fence: Q₁−1.5×IQR. Upper fence: Q₃+1.5×IQR. IQR captures middle 50%; range captures everything including extremes.

Grouped median

\[\text{Median}=L+w\times\frac{\text{needed}}{\text{class freq}}\]

L=lower bound, w=class width. Find which class contains the n/2-th observation using cumulative frequencies, then interpolate.

Chapter 2 — Probability

Equally likely P(A)

\[P(A)=\frac{n}{N}\]

Only valid when all outcomes are equally likely. n=favourable, N=total. Check this assumption first.

Permutations

\[{}^nP_r=\frac{n!}{(n-r)!}\]

Order matters: arrange, rank, queue. ⁶P₄=6×5×4×3=360. Always ≥ ⁿCᵣ.

Combinations

\[{}^nC_r=\frac{n!}{r!(n-r)!}\]

Order irrelevant: choose, committee, group. ⁶C₄=15. ⁿCᵣ = ⁿPᵣ/r! since r! arrangements of same r items all count as one.

Additive law

\[P(A\cup B)=P(A)+P(B)-P(A\cap B)\]

Subtract intersection to avoid double-counting. If ME: P(A∩B)=0, simplifies to P(A)+P(B).

Complement

\[P(A^c)=1-P(A)\]

Powerful for "at least one": P(≥1 success)=1−P(no success). Often much simpler than computing directly.

Conditional probability

\[P(A|B)=\frac{P(A\cap B)}{P(B)}\]

Denominator = P(B) always. Restricts sample space to B. Common error: using P(A) in denominator.

Bayes' theorem

\[P(A|B)=\frac{P(B|A)\,P(A)}{P(B)}\]

Reverse conditional: "given test positive, what's P(disease)?" Use total probability for P(B) first. Base rate matters — even an accurate test gives low P(disease|+) when prevalence is low.

Total probability

\[P(A)=\sum_i P(A|B_i)P(B_i)\]

For a partition of S. On a tree: multiply along each branch to A, then sum across all branches leading to A.

Chapter 3 — Discrete Distributions

Binomial pmf

\[P(X=x)=\binom{n}{x}\pi^x(1-\pi)^{n-x}\]

X ~ Bin(n,π). Mean=nπ, Var=nπ(1−π). Four conditions: fixed n, fixed π, two outcomes, independence. X takes 0,1,…,n (n+1 values).

Poisson pmf

\[P(X=x)=\frac{e^{-\lambda}\lambda^x}{x!}\]

Mean=variance=λ (key property). Random events in continuous medium. No upper bound. Always scale λ to match question units.

CDF wording translations

\[P(X\ge x)=1-F(x-1),\quad P(X>x)=1-F(x)\]

"At least x" uses x−1 in the CDF. "More than x" uses x. The off-by-one is the most common exam error in this chapter.

Poisson approx. to binomial

\[\text{Bin}(n,\pi)\approx\text{Pois}(\lambda=n\pi)\]

When n>30, nπ<10, π extreme. Avoids large binomial coefficients. Better approximation for smaller π.

Chapter 4 — Continuous Distributions

Probability from PDF

\[P(a

P(X=x)=0. Endpoints don't matter. f(x) is density not probability. Use CDF difference to avoid integration when F is available.

Uniform U(a,b)

\[f(x)=\frac{1}{b-a},\quad E(X)=\frac{a+b}{2},\quad \text{Var}=\frac{(b-a)^2}{12}\]

P(c

Exponential Exp(λ)

\[P(X>x)=e^{-\lambda x},\quad E(X)=\frac{1}{\lambda},\quad \text{Var}=\frac{1}{\lambda^2}\]

Memoryless property. Paired with Poisson: if arrivals are Pois(λ), waiting times are Exp(λ).

Standardisation — X

\[Z=\frac{X-\mu}{\sigma}\sim N(0,1)\]

P(X

Standardisation — X̄ (CLT)

\[Z=\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\sim N(0,1)\]

Divide by σ/√n NOT σ — the most common exam error. Exact for normal population; approximate (n≥30) otherwise.

Chapter 5 — Sampling Distributions

Sampling dist. of X̄

\[E(\bar{X})=\mu,\quad \text{Var}(\bar{X})=\frac{\sigma^2}{n},\quad \text{S.E.}=\frac{\sigma}{\sqrt{n}}\]

Var is in squared units; S.E. is in original units. Doubling n reduces S.E. by √2, not 2. Halving S.E. requires 4× larger n.

Sampling dist. of P

\[E(P)=\pi,\quad \text{Var}(P)=\frac{\pi(1-\pi)}{n},\quad Z=\frac{P-\pi}{\sqrt{\pi(1-\pi)/n}}\]

Use π (not p) in theoretical variance. Large n approximation. S.E.(P) = √[π(1−π)/n].

Chi-squared for S²

\[\frac{(n-1)S^2}{\sigma^2}\sim\chi^2_{n-1}\]

Valid only for normal populations. Transform inequality to get P(χ²_{n-1} > value) and use chi-squared table. E(χ²_k)=k.

Chapter 6 — Point Estimation

Bias

\[\text{Bias}(\hat\theta)=E(\hat\theta)-\theta\]

>0: overestimates. =0: unbiased. Does not decrease with n — systematic error. Standard estimators: E(X̄)=μ, E(P)=π, E(S²)=σ².

MSE decomposition

\[\text{MSE}(\hat\theta)=\text{Var}(\hat\theta)+[\text{Bias}(\hat\theta)]^2\]

Compare MSE (not just bias) when choosing estimators. Biased estimator can win if Var is much smaller. For unbiased: MSE=Var.

Non-linear transform warning

\[E(\hat\theta^2)=\text{Var}(\hat\theta)+\theta^2>\theta^2\]

Unbiasedness is NOT preserved under non-linear transforms. S² unbiased for σ², but S NOT unbiased for σ. X̄ unbiased for μ, but 1/X̄ generally not unbiased for 1/μ.

Chapter 7 — Interval Estimation

Single mean CI (σ known)

\[\bar{x}\pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\]

Use only when σ is given exactly. 95% → z=1.96; 99% → z=2.576; 90% → z=1.645.

Single mean CI (σ unknown)

\[\bar{x}\pm t_{\alpha/2,\,n-1}\frac{s}{\sqrt{n}}\]

ν=n−1 df. t-distribution has fatter tails → wider CI than z for same confidence level.

Single proportion CI

\[p\pm z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}}\]

Uses p in ESE (contrast with hypothesis test which uses π₀). Large n approximation required.

Sample size — mean

\[n\ge\frac{z_{\alpha/2}^2\,\sigma^2}{e^2}\]

e=desired margin of error. ALWAYS ROUND UP. 95%, σ=0.05, e=0.01: n≥96.04 → n=97.

Sample size — proportion

\[n\ge\frac{z_{\alpha/2}^2\,p(1-p)}{e^2}\]

Use p=0.5 if no pilot estimate — maximises p(1−p)=0.25 giving largest (conservative) sample size. ROUND UP.

Pooled variance s²_p

\[s_p^2=\frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{n_1+n_2-2}\]

df=n₁+n₂−2. Use when equal variances assumed. Weights by degrees of freedom for better estimate of common σ².

Paired samples CI

\[d_i=x_i-y_i,\quad \bar{d}\pm t_{\alpha/2,n-1}\frac{s_d}{\sqrt{n}}\]

Reduce to one-sample problem. df=n−1 (n=number of pairs). Use when same subject measured twice.

Chapter 8 — Hypothesis Testing

Single mean test (σ known)

\[Z=\frac{\bar{X}-\mu_0}{\sigma/\sqrt{n}}\sim N(0,1)\]

Example: x̄=1570, μ₀=1600, σ=120, n=100: Z=(1570−1600)/12=−2.5. Two-tailed 5%: reject if |Z|>1.96. Yes, reject.

Single mean test (σ unknown)

\[T=\frac{\bar{X}-\mu_0}{s/\sqrt{n}}\sim t_{n-1}\]

ν=n−1. Look up t critical value, not z. Use when σ is estimated from data.

Single proportion test

\[Z=\frac{P-\pi_0}{\sqrt{\pi_0(1-\pi_0)/n}}\sim N(0,1)\]

CRITICAL: use π₀ (null value) in SE denominator — NOT p. Under H₀ we assume π=π₀ is true. Differs from CI ESE.

Two-proportion pooled test

\[\hat{p}=\frac{r_1+r_2}{n_1+n_2},\quad Z=\frac{p_1-p_2}{\sqrt{\hat{p}(1-\hat{p})(1/n_1+1/n_2)}}\]

Under H₀: π₁=π₂, pool both samples. Do NOT use p₁,p₂ separately in test SE. Pooled p̂ reflects equality assumption.

Paired test

\[T=\frac{\bar{d}}{s_d/\sqrt{n}}\sim t_{n-1},\quad d_i=x_i-y_i\]

H₀: μ_d=0. Compute differences first, then apply one-sample t-test. ν=n−1 pairs.

Chapter 9 — Chi-Squared Tests

Expected frequency

\[E_{ij}=\frac{R_i\times C_j}{N}\]

Computed under H₀ (independence). All E_ij must be ≥5 for chi-squared approximation to be valid.

Chi-squared statistic

\[\chi^2=\sum_{i,j}\frac{(O_{ij}-E_{ij})^2}{E_{ij}}\]

Always ≥0. Large = evidence against H₀ (independence). Always upper-tailed. Cells with large contributions identify nature of association.

Degrees of freedom

\[\nu=(r-1)(c-1)\text{ (association)},\quad \nu=k-1-m\text{ (GOF)}\]

Association: r rows, c columns. GOF: k categories, m parameters estimated from data. 2×2 table: ν=1.

Chapter 10 — Correlation and Regression

Sum-of-squares shortcut

\[S_{xy}=\sum x_iy_i-\frac{\sum x_i\sum y_i}{n}\]

Also: S_xx=Σx²−(Σx)²/n, S_yy=Σy²−(Σy)²/n. Compute all three first in any regression question.

Correlation coefficient

\[r=\frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}},\quad -1\le r\le 1\]

r=0 ≠ no relationship. Scale-invariant. Correlation ≠ causation. In SLR: R²=r².

Regression coefficients

\[\hat\beta_1=\frac{S_{xy}}{S_{xx}},\quad \hat\beta_0=\bar{y}-\hat\beta_1\bar{x}\]

Line passes through (x̄,ȳ). β̂₁ interpretation: predicted y changes by β̂₁ per unit increase in x.

Residual SS and R²

\[RSS=S_{yy}-\frac{S_{xy}^2}{S_{xx}},\quad R^2=1-\frac{RSS}{S_{yy}}\]

TSS=S_yy, ESS=S_xy²/S_xx. TSS=ESS+RSS. R²=r² in SLR only. Interpet as proportion of variation explained.

Residual variance + slope test

\[s^2=\frac{RSS}{n-2},\quad T=\frac{\hat\beta_1}{s/\sqrt{S_{xx}}}\sim t_{n-2}\]

df=n−2 (two parameters). H₀: β₁=0 (no linear relationship). CI: β̂₁±t_{α/2,n-2}×s/√S_xx. Avoid extrapolation.

1 / 13

ST107 Revision

Data Visualisation & Descriptive Statistics

Median

Grouped-data median (interpolation)

Skewness from location measures

Why this matters

Example

Histogram exam checklist

Exam requirements

Example

Probability Theory

Equally likely outcomes

Key set notation

Axioms

Additive law

Complement

Independence

Conditional probability

Bayes' formula

Total probability

Decision guide

Examples

Partition conditions

Discrete Probability Distributions

Poisson conditions

Adjusting λ to match the unit

Translating exam wording

Example

Variance shortcut

Binomial vs Poisson — key question

Continuous Probability Distributions

PDFs

Valid PDF conditions

Computing probabilities

PDF ↔ CDF relationship

Standardisation

Standardisation exam method

Poisson-Exponential link

Memoryless property

Sampling Distributions of Statistics

Estimator vs estimate

Standard error

Standardisation for X̄

Chi-squared properties

Exam methodology for S² questions

Example

Var(X̄) vs S.E.(X̄)

Point Estimation

Types of bias

Sampling error

Relative efficiency

Practical implication

Common exam traps

Interval Estimation

Common multipliers

Correct interpretation

Single mean, σ known

Single mean, σ unknown (use t_{n-1})

Single proportion

Difference in means, σ known

Difference in means, equal variance pooled (t_{n1+n2-2})

Paired samples (reduce to single mean)

For estimating a mean

For estimating a proportion

Example

Degrees of freedom

Interpreting the interval

Paired vs independent samples

Hypothesis Testing

Types of test

Single mean, σ known

Single mean, σ unknown

Single proportion

Two proportions (pooled under H₀)

Two means, variances known

Paired samples

Two-tailed (z_{α/2})

Upper-tailed (z_α)

Lower-tailed

P-value method