Meeting 6 Solutions

Multivariate regression. Add the effect of sex as an indicator (or index) variable to the regression of height vs weight for the !Kung data. Fit this multivariate regression using the first three methods (i.e., lm(), likelihood function, and quap) presented in the R notebook that will be available here.

Reading and standardizing the data

library(rethinking)

## Loading required package: rstan

## Loading required package: StanHeaders

## Loading required package: ggplot2

## rstan (Version 2.21.2, GitRev: 2e1f913d3ca3)

## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)

## Loading required package: parallel

## rethinking (Version 2.13)

## 
## Attaching package: 'rethinking'

## The following object is masked from 'package:stats':
## 
##     rstudent

data(Howell1)
d <- Howell1
d2 <- d[ d$age >= 18 , ]
head(d2)

##    height   weight age male
## 1 151.765 47.82561  63    1
## 2 139.700 36.48581  63    0
## 3 136.525 31.86484  65    0
## 4 156.845 53.04191  41    1
## 5 145.415 41.27687  51    0
## 6 163.830 62.99259  35    1

plot(height~weight, data=d2)

# Scaling the predictor variable weight and the response varible height
# both to mean = 0, sd = 1.
d3<-d4<-d2
d3[,'weight']<-scale(d3[,'weight'])
d3[,'height']<-scale(d3[,'height'])

### Standardizing only the predictor weight
###  I would probably not standardize the response variable height, but
### proceed using d3 since I did in the previous examples.
d4[,'weight']<-scale(d4[,'weight'])

Plotting data but this code requires the ggplot2 library

library(ggplot2)
ggplot(d3,aes(y=height, x=weight, color=factor(male))) + geom_point()

Using lm()

out<-lm(height~weight+male,data=d3)
summary(out,correlation=TRUE) # # correlation about -0.99

## 
## Call:
## lm(formula = height ~ weight + male, data = d3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.81387 -0.32943  0.06031  0.33941  1.82744 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.39355    0.04379  -8.987   <2e-16 ***
## weight       0.53471    0.03459  15.458   <2e-16 ***
## male         0.83958    0.06922  12.129   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5518 on 349 degrees of freedom
## Multiple R-squared:  0.6973, Adjusted R-squared:  0.6955 
## F-statistic: 401.9 on 2 and 349 DF,  p-value: < 2.2e-16
## 
## Correlation of Coefficients:
##        (Intercept) weight
## weight  0.39             
## male   -0.74       -0.52

library(stats)
sigma(out)

## [1] 0.5517915

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 151.5501 0.3391 446.98 <2e-16 weight 4.1399 0.2678 15.46 <2e-16 male 6.5003 0.5359 12.13 <2e-16 ***

sigma 4.272153

Using our own likelihood function. Here is the likelihood function:

nllNormReg<-function(parVec,data=d3){
  wt<-data$weight
  ht<-data$height
  male<-data$male
  intercept<-parVec[1]
  beta<-parVec[2]
  maleEffect<-parVec[3]
  mySd<-parVec[4]
  htPred<-intercept+beta*wt + maleEffect*male
  nllik<- -sum(dnorm(x=ht,mean=htPred,sd=mySd,log=TRUE))
  return(nllik)
}

Minimizing the negative log likelihood:

parVec<-c(1.0,2.0,2.0,2.0) # Need some starting parameters for gradient climbing/descending methods
outLM<-optim(par=parVec,fn=nllNormReg,method="L-BFGS-B",
               lower=c(-Inf,-Inf,-Inf,0.0001),upper=c(Inf,Inf,Inf,Inf))
outLM$par #

## [1] -0.3935526  0.5347038  0.8395813  0.5494369

These are the same from lm() above

Using quap:

library(rethinking)

modelQUAP <- quap(
    alist(
        height ~ dnorm( mu , sigma ) ,
        mu <- intercept + beta*weight + maleEffect*male,
        intercept ~ dnorm( 10 , 20 ) ,
        beta ~ dnorm( 0 , 5 ) ,
        maleEffect ~ dnorm( 0 , 5 ) ,
        sigma ~ dunif( 0 , 50 )
    ) , data=d3 )

precis( modelQUAP )

##                  mean         sd       5.5%      94.5%
## intercept  -0.3934411 0.04360295 -0.4631271 -0.3237552
## beta        0.5347373 0.03444200  0.4796923  0.5897822
## maleEffect  0.8393879 0.06891719  0.7292449  0.9495308
## sigma       0.5494365 0.02070744  0.5163420  0.5825310

These results are the same as for lm() above.

6E1. List three mechanisms by which multiple regression can produce false inferences about causal effects.

The mechanisms outlined in the chapter (and the previous) were: (1) confounding (a third variable that influences the exposure and outcome) (2) multicollinearity, (3) conditioning on post-treatment variables, (4) collider bias.

6E2. For one of the mechanisms in the previous problem, provide an example of your choice, perhaps from your own research.

This is specific to your own experience and preference and so there is no single correct answer.

6E3. List the four elemental confounds. Can you explain the conditional dependencies of each?

The fork, X ← Z → Y, Y independent of X|Z
The pipe, X→Z→Y, Y independent of X|Z
The collider, X→Z←Y,Y not independent of X|Z
The descendant, which is a variable influenced by another. When we condition on a descendent, we weakly condition on its parent, and so what happens depends upon the parent’s place in the graph.

6E4. How is a biased sample like conditioning on a collider? Think of the example at the open of the chapter.

The biased sample in the introduction to the chapter are those proposals that were good enough to pass a threshold and be selected for funding. A proposal can pass a threshold if either criterion (e.g., trustworthy or newsworthy) is good enough. Therefore among those selected, there tends to be a negative association between the criteria. The association is negative also in those that aren’t selected, but we don’t get to see those.

In conditioning on a collider, the sample is similarly divided into sub-samples. And the association in the sub-samples can be different than in the total sample. But we do it ourselves and it can produce similar statistical consequences as selection bias.

Think about a lamp. It can be either on or off. The lamp is caused by the switch being on and the electricity working. It is a collider. If we condition on the lamp’s state, then there are two sub- samples: those lamps that are on and those that are off. Among that that are on, both the switch and the electricity must be on. Among those that are off, at least one is off. Therefore there can be a negative association between the switch and the electricity in the sub-sample of lamps that are off, even when there is no association between the switch and electricity in the entire sample.

6M1. Modify the DAG on page 186 to include the variable V, an unobserved cause of C and Y: C ← V → Y. Reanalyze the DAG. How many paths connect X to Y? Which must be closed? Which variables should you condition on now?

library(dagitty)

dag_6M1<-dagitty("dag{
U[unobserved]
V[unobserved]
X->Y
X<-U->B<-C->Y
U<-A->C
C<-V->Y }")



coordinates(dag_6M1)<-list(x=c(X=0,Y=2,U=0,A=1,B=1,C=2,V=2.5),
                           y=c(X=2,Y=2,U=1,A=0.5,B=1.5,C=1,V=1.5))
drawdag(dag_6M1)

There are 5 paths connecting X and Y: (1) X → Y (the causal path) (2) X←U→B←C→Y (3) X←U→B←C←V→Y (4) X←U←A→C→Y (5) X←U←A→C←V→Y

We want to leave path 1 open and make sure all of the others are closed, because they are non-causal paths that will confound inference. As before, all the paths through B are already closed, since B is a collider. So we don’t condition on B. Similarly, the new paths through both A and V are closed, because C is a collider on those paths. So it’s enough to condition on A to close all non-causal paths. If you like, you can check using dagitty:

 adjustmentSets( dag_6M1 , exposure="X" , outcome="Y" )

## { A }

6M3. Learning to analyze DAGs requires practice. For each of the four DAGs below, state which variables, if any, you must adjust for (condition on) to estimate the total causal influence of X on Y. (SEE ILLUSTRATIONS OF DAGS IN BOOK WITH PROBLEM 6M3)

In each case, the procedure is the same: List all paths between X and Y and figure out which non-causal paths are open and which variables you can use to close them. We’ll consider each in turn.

Top left. There are three paths between X and Y: (1) X → Y,(2) X ← Z → Y,(3)X ← Z ← A → Y. Both (2) and (3) are open, non-causal paths. Conditioning on Z is sufficient to close both.

Top right. Again three paths: (1)X → Y,(2)X → Z → Y,(3)X → Z ← A → Y. Paths (1) and (2) are both causal. We want both open. Path (3) is non-causal, but it is also closed already because Z is a collider on that path. So no conditioning.

Bottom left. Paths: (1) X → Y,(2) X → Z ← Y,(3) X ← A → Z ← Y. Only path (1) is causal. The other paths are both closed, because both contain a collider at Z. Conditioning on Z would be a disaster. So no conditioning.

Bottom right. Paths: (1) X → Y,(2) X → Z → Y,(3) X ← A → Z → Y. Paths (1) and (2) are causal. Path (3) is an open backdoor path. To close it, we must condition on A or Z, but if condition on Z, that would close a causal path. So condition on A.

CHALLENGE. 6H1. Use the Waffle House data, data(WaffleDivorce), to find the total causal influence of number of Waffle Houses on divorce rate. Justify your model or models with a causal graph.

How hard this is depends upon the DAG you draw. Here is one possibility.

data(WaffleDivorce)
head(WaffleDivorce)

##     Location Loc Population MedianAgeMarriage Marriage Marriage.SE Divorce
## 1    Alabama  AL       4.78              25.3     20.2        1.27    12.7
## 2     Alaska  AK       0.71              25.2     26.0        2.93    12.5
## 3    Arizona  AZ       6.33              25.8     20.3        0.98    10.8
## 4   Arkansas  AR       2.92              24.3     26.4        1.70    13.5
## 5 California  CA      37.25              26.8     19.1        0.39     8.0
## 6   Colorado  CO       5.03              25.7     23.5        1.24    11.6
##   Divorce.SE WaffleHouses South Slaves1860 Population1860 PropSlaves1860
## 1       0.79          128     1     435080         964201           0.45
## 2       2.05            0     0          0              0           0.00
## 3       0.74           18     0          0              0           0.00
## 4       1.22           41     1     111115         435450           0.26
## 5       0.24            0     0          0         379994           0.00
## 6       0.94           11     0          0          34277           0.00

library(dagitty)
dag_6H1 <- dagitty("dag{A -> D; A -> M -> D; A <- S -> D; S -> W -> D}") 
coordinates(dag_6H1) <- list(
    x=c(A=0,M=1,D=2,S=1,W=2),
y=c(A=0.75,M=0.5,D=1,S=0,W=0) ) 
drawdag(dag_6H1)

Given this DAG, you can probably see that all we need to control for is S, an indicator for southern States. This is needed to block the backdoor paths through the other variables.

It’s very likely that there are unobserved variables that confound this inference. For example, if Waffle House tends to build in places where it is cheap to operate, then there are economic confounds that might influence both the presence of Waffle Houses and divorce rate. In that case, we’d need to measure those confounds to get an inference. Lots of social scientists believe that it is usually impossible to measure all the probably confounds in these studies. They look for natural or quasi- experiments instead.