GHV
1
Overview
1.1
The three challenges of statistics
1.2
Why learn regression?
1.3
Some examples of regression
1.3.1
A randomized experiment on the effect of an educational television program
1.3.2
Comparing the peacekeeping and gun-control studies
1.4
Challenges in building, understanding, and interpreting regressions
1.4.1
Regression to estimate a relationship of interest
1.4.2
Regression to adjust for differences between treatment and control groups
1.4.3
Interpreting coefficients in a predictive model
1.4.4
Building, interpreting, and checking regression models
1.5
Classical and Bayesian inference
1.5.1
Information
1.5.2
Assumptions
1.5.3
Classical inference
1.5.4
Bayesian inference
1.6
Computing least squares and Bayesian regression
2
Data and Measurement
2.1
Examining where data come from
2.1.1
Details of measurement can be important
2.2
Validity and reliability
2.2.1
Validity
2.2.2
Reliability
2.2.3
Sample selection
2.3
All graphs are comparisons
2.3.1
Simple scatterplots
2.3.2
Displaying more information on a graph
2.3.3
Multiple plots
2.3.4
Grids of plots
2.3.5
Applying graphical principles to numerical displays and communication more generally
2.3.6
Graphics for understanding statistical models
2.3.7
Graphs as comparisons
2.3.8
Graphs of fitted models
2.4
Data and adjustment: trends in mortality rates
3
Some Basic Methods in Mathematics and Probability
3.1
Weighted averages
3.2
Vectors and matrices
3.3
Graphing a line
3.4
Exponential and power-law growth and decline; logarithmic and log-log relationships
3.5
Probability distributions
3.5.1
Mean and standard deviation of a probability distribution
3.5.2
Normal distribution; mean and standard deviation
3.5.3
Linear transformations
3.5.4
Mean and standard deviation of the sum of correlated random variables
3.5.5
Lognormal distribution
3.5.6
Binomial distribution
3.5.7
Poisson distribution
3.5.8
Unclassified probability distribution
3.5.9
Probability distributions for error
3.5.10
Comparing distributions
3.6
Probability modeling
3.6.1
Using an empirical forecast
3.6.2
Using an reasonable-seeming but inappropriate probability model
3.6.3
General lessons for probability modeling
4
Statistical Inference
4.1
Sampling distributions and generative models
4.1.1
Sampling, measurement error, and model error
4.1.2
The sampling distribution
4.2
Estimates, standard errors, and confidence intervals
4.2.1
Parameters, estimands, and estimates
4.2.2
Standard errors, inferential uncertainty, and confidence intervals
4.2.3
Standard errors and confidence intervals for average and proportions
4.2.4
Standard error and confidence interval for a proportion when
\(y = 0\)
or
\(y = n\)
4.2.5
Standard error for a comparison
4.2.6
Sampling distribution of the sample mean and standard deviation; normal and
\(\chi^2\)
distributions
4.2.7
Degrees of freedom
4.2.8
Confidence intervals from the
\(t\)
distribution
4.2.9
Inference for discrete data
4.2.10
Linear transformations
4.2.11
Comparisons, visual and numerical
4.2.12
Weighted averages
4.3
Bias and unmodeled uncertainty
4.3.1
Bias in estimation
4.3.2
Adjusting inferences to account for bias and unmodeled uncertainty
4.4
Statistical significance, hypothesis testing, and statistical errors
4.4.1
Statistical significance
4.4.2
Hypothesis testing for simple comparisons.
4.4.3
Hypothesis testing: general formulation.
4.4.4
Comparisons of parameters to fixed values and each other: interpreting confidence intervals as hypothesis tests.
4.4.5
Type 1 and type 2 errors and why we don’t like talking about them.
4.4.6
4.4.6 Type M (magnitude) and type S (sign) errors.
4.4.7
Hypothesis testing and statistical practice.
4.5
Problems with the concept of statistical significance
4.5.1
Statistical significance is not the same as practical importance.
4.5.2
Non-significance is not the same as zero.
4.5.3
The difference between “significant” and “not significant” is not itself statistically significant.
4.5.4
Researcher degrees of freedom,
\(p\)
-hacking, and forking paths.
4.5.5
The statistical significance filter.
4.6
Moving beyond hypothesis testing
5
Simulation
5.1
Simulation of discrete probability models
5.1.1
How many girls in 400 births?
5.1.2
Accounting for twins
5.2
Simulation of continuous and mixed discrete/continuous models
5.2.1
Simulation in R using custom-made functions
5.3
Summarizing a set of simulations using median and median absolute deviation
5.4
Bootstrapping to simulate a sampling distribution
5.4.1
Choices in defining the bootstrap distribution
5.4.2
Limitations of bootstrapping
5.5
Fake-data simulation as a way of life
6
Background on regression modeling
6.1
Regression models
6.2
Fitting a simple regression to fake data
6.2.1
Fitting a regression and displaying the results
6.2.2
Comparing estimates to assumed parameter values
6.3
Interpret coefficients as comparisons, not effects
6.4
Historical origins of regression
6.4.1
Daughters’ heights “regressing” to the mean
6.4.2
Fitting the model in R
6.5
The paradox of regression to the mean
6.5.1
How regression to the mean can confuse people about causal inference; demonstration using fake data
6.5.2
Relation of “regression to the mean” to the larger themes of the book
7
Linear regression with a single predictor
7.1
Example: predicting presidential vote share from the economy
7.1.1
Fitting a linear model to data
7.1.2
Understanding the fitted model
7.1.3
Graphing the fitted regression line
7.1.4
Using the model to predict
7.2
Checking the model-fitting procedure using fake-data simulation
7.2.1
Step 1: Creating the pretend world
7.2.2
Step 2: Simulating fake data
7.2.3
Step 3: Fitting the model and comparing fitted to assumed values
7.2.4
Step 4: Embedding the simulation in a loop
7.3
Formulating comparisons as regression models
7.3.1
Estimating the mean is the same as regressing on a constant term
7.3.2
Estimating a difference is the same as regressing on an indicator variable
8
Fitting regression models
8.1
Least squares, maximum likelihood, and Bayesian inference
8.1.1
Least squares
8.1.2
Estimation of residual standard deviation
\(\sigma\)
8.1.3
Computing the sum of squares directly
8.1.4
Maximum likelihood
8.1.5
Where do the standard errors come from? Using the likelihood surface to assess uncertainty in the parameter estimates
8.1.6
Bayesian inference
8.1.7
Point estimate, mode-based approximation, and posterior simulations
8.2
Influence of individual points in a fitted regression
8.3
Least squares slope as a weighted average of slopes of pairs
8.4
Comparing two fitting functions:
lm
and
stan_glm
8.4.1
Reproducing maximum likelihood using
stan_glm
with flat priors and optimization
8.4.2
Running
lm
8.4.3
Confidence intervals, uncertainty intervals, compatibility intervals
9
Prediction and Bayesian inference
9.1
Propagating uncertainty in inference using posterior simulations
9.1.1
Uncertainty in the regression coefficients and implied uncertainty in the regression line
9.1.2
Using the matrix of posterior simulations to express uncertainty about a parameter estimate or function of parameter estimates
9.2
Prediction and uncertainty: predict, posterior_linpred, and posterior_predict
9.2.1
Point prediction using predict
9.2.2
Linear predictor with uncertainty using posterior_linpred or posterior_epred
9.2.3
Predictive distribution for a new observation using posterior_predict
9.2.4
Prediction given a range of input values
9.2.5
Propagating uncertainty
9.2.6
Simulating uncertainty for the linear predictor and new observations
9.3
Prior information and Bayesian synthesis
9.3.1
Expressing data and prior information on the same scale
9.3.2
Bayesian information aggregation
9.3.3
Different ways of assigning prior distributions and performing Bayesian calculations
9.4
Example of Bayesian inference: beauty and sex ratio
9.4.1
Prior information
9.4.2
Prior estimate and standard error
9.4.3
Data estimate and standard error
9.4.4
Bayes estimate
9.4.5
Understanding the Bayes estimate
9.5
Uniform, weakly informative, and informative priors in regression
9.5.1
Uniform prior distribution
9.5.2
Default prior distribution
9.5.3
Weakly informative prior distribution based on subject-matter knowledge
9.5.4
Example where an informative prior makes a difference: Beauty and sex ratio
10
Linear regression with multiple predictors
10.1
Adding predictors to a model
10.1.1
Starting with a binary predictor
10.1.2
A single continuous predictor
10.1.3
Including both predictors
10.1.4
Understanding the fitted model
10.2
Interpreting regression coefficients
10.2.1
It’s not always possible to change one predictor while holding all others constant
10.2.2
Counterfactual and predictive interpretations
10.3
Interactions
10.3.1
When should we look for interactions?
10.3.2
Interpreting regression coefficients in the presence of interactions
10.4
Indicator variables
10.4.1
Centering a predictor
10.4.2
Including a binary variable in a regression
10.4.3
Using indicator variables for multiple levels of a categorical predictor
10.4.4
Changing the baseline factor level
10.4.5
Using an index variable to access a group-level predictor
10.5
Formulating paired or blocked designs as a regression problem
10.5.1
Completely randomized experiment
10.5.2
Paired design
10.5.3
Block design
10.6
Example: uncertainty in predicting congressional elections
10.6.1
Background
10.6.2
Data issues
10.6.3
Fitting the model
10.6.4
Simulation for inferences and predictions of new data points
10.6.5
Predictive simulation for a nonlinear function of new data
10.6.6
Combining simulation and analytic calculations
10.7
Mathematical notation and statistical inference
10.7.1
Predictors
10.7.2
Regression in vector-matrix notation
10.7.3
Two ways of writing the model
10.7.4
Nonidentified parameters, collinearity, and the likelihood function
10.7.5
Hypothesis testing: why we do not like
\(t\)
tests and
\(F\)
tests
10.8
Weighted regression
10.8.1
Three models leading to weighted regression
10.8.2
Using a matrix of weights to account for correlated errors
10.9
Fitting the same model to many datasets
10.9.1
Predicting party identification
11
Assumptions, diagnostics, and model evaluation
11.1
Assumptions of regression analysis
11.1.1
Failures of the assumptions
11.1.2
Causal inference
11.2
Plotting the data and fitted model
11.2.1
Displaying a regression line as a function of one input variable
11.2.2
Displaying two fitted regression lines
11.2.3
Displaying uncertainty in the fitted regression
11.2.4
Displaying using one plot for each input variable
11.2.5
Plotting the outcome vs. a continuous predictor
11.2.6
Forming a linear predictor from a multiple regression
11.3
Residual plots
11.3.1
Using fake-data simulation to understand residual plots
11.3.2
A confusing choice: plot residuals vs. predicted values, or residuals vs. observed values?
11.3.3
Understanding the choice using fake-data simulation
11.4
Comparing data to replications from a fitted model
11.4.1
Example: simulation-based checking of a fitted normal distribution
11.5
Example: predictive simulation to check the fit of a time-series model
11.5.1
Fitting a first-order autoregression to the unemployment series
11.5.2
Simulating replicated datasets
11.5.3
Visual and numerical comparisons of replicated to actual data
11.6
Residual standard deviation
\(\sigma\)
and explained variance
\(R^2\)
11.6.1
Difficulties in interpreting residual standard deviation and explained variance
11.6.2
Bayesian
\(R^2\)
11.7
External validation: checking fitted model on new data
11.8
Cross validation
11.8.1
Leave-one-out cross validation
11.8.2
Fast leave-one-out cross validation
11.8.3
Summarizing prediction error using the log score and deviance
11.8.4
Overfitting and AIC
11.8.5
Interpreting differences in log scores
11.8.6
Demonstration of adding pure noise predictors to a model
11.8.7
\(K\)
-fold cross validation
11.8.8
Demonstration of
\(K\)
-fold cross validation using simulated data
11.8.9
Concerns about model selection
12
Transformations and regression
12.1
Linear transformations
12.1.1
Scaling of predictors and regression coefficients
12.1.2
Standardization using z-scores
12.1.3
Standardization using an externally specified population distribution
12.1.4
Standardization using reasonable scales
12.2
Centering and standardizing for models with interactions
12.2.1
Centering by subtracting the mean of the data
12.2.2
Using a conventional centering point
12.2.3
Standardizing by subtracting the mean and dividing by 2 standard deviations
12.2.4
Why scale by 2 standard deviations?
12.2.5
Multiplying each regression coefficient by 2 standard deviations of its predictor
12.3
Correlation and “regression to the mean”
12.3.1
The principal component line and the regression line
12.3.2
Regression to the mean
12.4
Logarithmic transformations
12.4.1
Earnings and height example
12.4.2
Why we use natural log rather than log base 10
12.4.3
Building a regression model on the log scale
12.4.4
Further difficulties in interpretation
12.4.5
Log-log model: transforming the input and outcome variables
12.4.6
Taking logarithms even when not necessary
12.5
Other transformations
12.5.1
Square root transformations
12.5.2
Idiosyncratic transformations
12.5.3
Using continuous rather than discrete predictors
12.5.4
Using discrete rather than continuous predictors
12.5.5
Index and indicator variables
12.5.6
Indicator variables, identifiability, and the baseline condition
12.6
Building and comparing regression models for prediction
12.6.1
General principles
12.6.2
Example: predicting the yields of mesquite bushes
12.6.3
Using the Jacobian to adjust the predictive comparison after a transformation
12.6.4
Constructing a simpler model
12.7
Models for regression coefficients
12.7.1
Other models for regression coefficients
13
Logistic regression
13.1
Logistic regression with a single predictor
13.1.1
Example: modeling political preference given income
13.1.2
The logistic regression model
13.1.3
Fitting the model using stan_glm and displaying uncertainty in the fitted model
13.2
Interpreting logistic regression coefficients and the divide-by-4 rule
13.2.1
Evaluation at and near the mean of the data
13.2.2
The divide-by-4 rule
13.2.3
Interpretation of coefficients as odds ratios
13.2.4
Coefficient estimates and standard errors
13.2.5
Statistical significance
13.2.6
Displaying the results of several logistic regressions
13.3
Predictions and comparisons
13.3.1
Point prediction using
predict
13.3.2
Linear predictor with uncertainty using
posterior_linpred
13.3.3
Expected outcome with uncertainty using
posterior_epred
13.3.4
Predictive distribution for a new observation using
posterior_predict
13.3.5
Prediction given a range of input values
13.3.6
Logistic regression with just an intercept
13.3.7
Logistic regression with a single binary predictor
13.4
Latent-data formulation
13.4.1
Interpretation of the latent variables
13.4.2
Nonidentifiability of the latent scale parameter
13.5
Maximum likelihood and Bayesian inference for logistic regression
13.5.1
Maximum likelihood using iteratively weighted least squares
13.5.2
Bayesian inference with a uniform prior distribution
13.5.3
Default prior in
stan_glm
13.5.4
Bayesian inference with some prior information
13.5.5
Comparing maximum likelihood and Bayesian inference using a simulation study
13.6
Cross validation and log score for logistic regression
13.6.1
Understanding the log score for discrete predictions
13.6.2
Log score for logistic regression
14
Simulation
14.1
Simulation of discrete probability models
14.1.1
How many girls in 400 births?
14.1.2
Accounting for twins
14.2
Simulation of continuous and mixed discrete/continuous models
14.2.1
Simulation in R using custom-made functions
14.3
Summarizing a set of simulations using median and median absolute deviation
14.4
Bootstrapping to simulate a sampling distribution
14.4.1
Choices in defining the bootstrap distribution
14.4.2
Limitations of bootstrapping
14.5
Fake-data simulation as a way of life
15
Simulation
15.1
Simulation of discrete probability models
15.1.1
How many girls in 400 births?
15.1.2
Accounting for twins
15.2
Simulation of continuous and mixed discrete/continuous models
15.2.1
Simulation in R using custom-made functions
15.3
Summarizing a set of simulations using median and median absolute deviation
15.4
Bootstrapping to simulate a sampling distribution
15.4.1
Choices in defining the bootstrap distribution
15.4.2
Limitations of bootstrapping
15.5
Fake-data simulation as a way of life
16
Simulation
16.1
Simulation of discrete probability models
16.1.1
How many girls in 400 births?
16.1.2
Accounting for twins
16.2
Simulation of continuous and mixed discrete/continuous models
16.2.1
Simulation in R using custom-made functions
16.3
Summarizing a set of simulations using median and median absolute deviation
16.4
Bootstrapping to simulate a sampling distribution
16.4.1
Choices in defining the bootstrap distribution
16.4.2
Limitations of bootstrapping
16.5
Fake-data simulation as a way of life
17
Simulation
17.1
Simulation of discrete probability models
17.1.1
How many girls in 400 births?
17.1.2
Accounting for twins
17.2
Simulation of continuous and mixed discrete/continuous models
17.2.1
Simulation in R using custom-made functions
17.3
Summarizing a set of simulations using median and median absolute deviation
17.4
Bootstrapping to simulate a sampling distribution
17.4.1
Choices in defining the bootstrap distribution
17.4.2
Limitations of bootstrapping
17.5
Fake-data simulation as a way of life
18
Simulation
18.1
Simulation of discrete probability models
18.1.1
How many girls in 400 births?
18.1.2
Accounting for twins
18.2
Simulation of continuous and mixed discrete/continuous models
18.2.1
Simulation in R using custom-made functions
18.3
Summarizing a set of simulations using median and median absolute deviation
18.4
Bootstrapping to simulate a sampling distribution
18.4.1
Choices in defining the bootstrap distribution
18.4.2
Limitations of bootstrapping
18.5
Fake-data simulation as a way of life
19
Simulation
19.1
Simulation of discrete probability models
19.1.1
How many girls in 400 births?
19.1.2
Accounting for twins
19.2
Simulation of continuous and mixed discrete/continuous models
19.2.1
Simulation in R using custom-made functions
19.3
Summarizing a set of simulations using median and median absolute deviation
19.4
Bootstrapping to simulate a sampling distribution
19.4.1
Choices in defining the bootstrap distribution
19.4.2
Limitations of bootstrapping
19.5
Fake-data simulation as a way of life
20
Simulation
20.1
Simulation of discrete probability models
20.1.1
How many girls in 400 births?
20.1.2
Accounting for twins
20.2
Simulation of continuous and mixed discrete/continuous models
20.2.1
Simulation in R using custom-made functions
20.3
Summarizing a set of simulations using median and median absolute deviation
20.4
Bootstrapping to simulate a sampling distribution
20.4.1
Choices in defining the bootstrap distribution
20.4.2
Limitations of bootstrapping
20.5
Fake-data simulation as a way of life
21
Simulation
21.1
Simulation of discrete probability models
21.1.1
How many girls in 400 births?
21.1.2
Accounting for twins
21.2
Simulation of continuous and mixed discrete/continuous models
21.2.1
Simulation in R using custom-made functions
21.3
Summarizing a set of simulations using median and median absolute deviation
21.4
Bootstrapping to simulate a sampling distribution
21.4.1
Choices in defining the bootstrap distribution
21.4.2
Limitations of bootstrapping
21.5
Fake-data simulation as a way of life
22
Simulation
22.1
Simulation of discrete probability models
22.1.1
How many girls in 400 births?
22.1.2
Accounting for twins
22.2
Simulation of continuous and mixed discrete/continuous models
22.2.1
Simulation in R using custom-made functions
22.3
Summarizing a set of simulations using median and median absolute deviation
22.4
Bootstrapping to simulate a sampling distribution
22.4.1
Choices in defining the bootstrap distribution
22.4.2
Limitations of bootstrapping
22.5
Fake-data simulation as a way of life
23
Exercises of Regression and Other Stories
23.1
Overview
23.1.1
From design to decision
23.1.2
Sketching a regression model and data
23.1.3
Goals of regression
23.1.4
Problems of statistics
23.1.5
Goals of regression
23.2
Causal inference
23.3
Statistics as generalization
23.4
Statistics as generalization
23.5
A problem with linear models
23.5.1
Working through your own example
23.6
Some basic methods in mathematics and probability
23.6.1
Weighted averages
23.6.2
Weighted averages
23.6.3
Probability distributions
23.6.4
Probability distributions
23.6.5
Probability distributions
23.6.6
Linear transformations
23.6.7
Linear transformations
23.6.8
Correlated random variables
23.6.9
Comparison of distributions
23.6.10
Working through your own example
References
Published with bookdown
Reading Notes for Regression and Other Stories
References
Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020.
Regression and Other Stories
. Cambridge University Press.