Statistical Rethinking: Chapter 5 Practice Answers

We use some of the code from the pymc-devs Python/pymc3 port of the Statistical Rethinking code examples.

5E1

(2) and (4). (3) is not because there is only one slope variable, $\beta$. In the case of (2) the value of $\alpha$ is just zero.

5E2

Since multiple regression gives you the relationship between one predictor and another once we conditional on other predictors, it serves as a natural way to control for variables. As a result, we can just define a model with latitude and plant diversity predictors. I'm unsure how to define priors for $\alpha$ because I'm not sure how animal or plant diversity is measured and what the bounds are. The $\beta$ variables would be normally distributed since in principle there could be a negative correlation, and in fact my limited knowledge of ecology would say that the closer latitude is to zero the more plant diversity there would be.

$$ A_i \sim \text{Normal}(\mu, \sigma)$$$$ \mu = \alpha + \beta_L L_i + \beta_P P_i$$$$ \alpha \sim ? $$$$ \beta_L \sim \text{Normal}(?, ?)$$$$ \beta_P \sim \text{Normal}(?, ?)$$$$ \sigma \sim \text{Exponential}(1)$$

5E3

To answer this question, we need 3 models. Two models are linear regression using just one predictor variable (funding or size of laboratory). The third model is the multiple regression model that uses both of those variables as predictors. I'll leave out the priors in the definitions. If the variables are positively associated, then the slope parameters will be positive.

Model 1

$$ T_i \sim \text{Normal}(\mu, \sigma)$$$$ \mu = \alpha + \beta_F F_i $$

Model 2

$$ T_i \sim \text{Normal}(\mu, \sigma)$$$$ \mu = \alpha + \beta_L L_i $$

Model 3

$$ T_i \sim \text{Normal}(\mu, \sigma)$$$$ \mu = \alpha + \beta_F F_i + \beta_L L_i $$

5E4

I'm unsure about this. (4) and (5) are definitely equivalent, but I'm not sure about the inferential equivalence of the other models.

5M1

This spurious correlation would occur when one of the predictors tells you a lot about the other one. For example, your height might be a predictor of how many points you score per game in basketball. Dollars earnt from basketball might also be a predictor of the amount of points you score per game. The idea would be that once you use dollars earnt and height as predictors together, the correlation between outcome and one of the predictors should vanish. Let's imagine that negative points means conceding points.

Now let's define some models and do inference. First let's predict dollars earnt based off of points.

Now let's predict dollars earnt based off of height

Now let's define some models and do inference. Now let's predict dollars earnt based off of height and points

Adapt plots to use the second beta coefficient, corresponding to points scored.

Note how the ranges for valid posteriors are much wider in the multiple regression, and even account for negative relationships between predictors and outcomes, when the singe-predictor linear regression was far less equivocal.

5M2

As an example of a masked relationship, we will do multiple regression to predict climbing grades. The predictor variables will be finger strength and weight. Finger strength is positively associated with the grade someone can climb. Weight is negatively associated. But finger strength is positively associated with weight.

What we expect is that there will be an unclear bivariate relationship between finger strength and climbing grade, but when multiple regression is used, the coefficient for finger strength will become very positive.

I Found an actual small dataset on this! Sure enough, the statistical experiment bore it out.

5M3

A high divorce rate might cause a higher marriage rate because one you divorce you're in a pool of people who can marry again. People are perhaps not so likely to divorce and stay single forever. They probably divorce and remarry. Thus higher divorce rates might cause marriage in that sense.

Predict marriage rate using both divorce rate and age of marriage as predictors. If divorce rate coefficient is high, then it suggests divorces have a role in marriages even when controlling for age of marriage.

5M4

So we see that Marriage rate is not a predictor of divorce rate when you control for median age at marriage and the proportion of LDS. The LDS proportion and median age at marriage are quite predictive.

5M5

The two mechanisms:

- Price of petrol means less driving and more exercise
- Price of petrol means less eating out and thus less food consumed.

We want to examine the role of each of these mechanisms. Multiple regression could do this by including two other predictor variables: Frequency of eating out and amount of exercise. Ideally the exercise predictor would be specifically the number of ours spent exercising specifically in the form of walking or riding a bike to commute, since you could still drive a lot and exercise a lot. Also, it would be good to actually have as a predictor the amount of driving that occurs. Both of the mechanisms above are predicated on the assumption that price of petrol actually decreases the amount of driving. But if that isn't the case, then we'd need to generate some other hypotheses.

5H1

The implied conditional independency is just $M \perp D | A$. To test if the data are consistent with it, we would create a multiple regression model predicting D and using M and A as our predictor, can also create one without A as a predictor. If the conditional independency is true, then there will be no relationship between M and D once you control for A, but otherwise there will be.

So we see that there is a slight negative association between marriage and divorce rates when you control for A. So they're perhaps not quite conditionally independent, but we can also conclude that marriage in and of itself doesn't cause divorce. How about when you don't include A as a predictor?

Strictly speaking, we didn't need to do this second regression to test the conditional independence claim, but if the real question was to see how A affects the relationship, then this tells us that.

3H2

Counterfactual plot to see the effect of halving a state's marriage rate. First we will do the counterfactual prediction in terms of standardized variables. Then we will find out what halving the marriage rate means in terms of the standardized variables and thus get the answer to our problem.

^ This is what the divorce rate would be if we halved the marriage rate. Compare this to the mean divorce rate:

5H3

3H4

Before we bring in Southernness, let's first acknowledge that once Age was controlled for, the relationship between M and D became very mild, possibly non-existent. So it might simply be the case that the DAG is D <- A -> M, without an edge M -> D. Regardless of whether we take or leave that edge, the effect of S on D is probably mediated through age of marriage. The implication of this will be that once we control for Age, the relationship between S and D will disappear.

So what we will do next is first regress S on D to see the correlation. Then we will do multiple regression of S and M on D. If the coefficient for S disappears in the latter model, then that will indicate that the hypothesis that there is an edge from S to A only will have some support.

So there's clearly a difference between the groups. Southernness is strongly associated with divorce rates. Now let's see what happens when we control for age of marriage.

Q: Do we want to have a separate slope that depends on the southernness?

This seems to be a case where an indicator variable is acutally useful instead of an index variable, because it will directly tell us how much teh southernness changes the mean.

So southerness clearly is associated with higher divorce rates. Now let's control for Age.

Southernness is still quite associated with divorce rates, even when controlling for age of marriages. So there's something else at play! We'll do one more regression, where we control for marriage rate just to be sure.

Yeah, it looks like there's another way southernness is associated with divorce rates that is not to do with age of marriages or marriage rate.