Reading and standardizing the data
library(rethinking)
## Loading required package: rstan
## Loading required package: StanHeaders
## Loading required package: ggplot2
## rstan (Version 2.21.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## Loading required package: parallel
## rethinking (Version 2.13)
##
## Attaching package: 'rethinking'
## The following object is masked from 'package:stats':
##
## rstudent
data(Howell1)
d <- Howell1
d2 <- d[ d$age >= 18 , ]
head(d2)
## height weight age male
## 1 151.765 47.82561 63 1
## 2 139.700 36.48581 63 0
## 3 136.525 31.86484 65 0
## 4 156.845 53.04191 41 1
## 5 145.415 41.27687 51 0
## 6 163.830 62.99259 35 1
plot(height~weight, data=d2)
# Scaling the predictor variable weight and the response varible height
# both to mean = 0, sd = 1.
d3<-d4<-d2
d3[,'weight']<-scale(d3[,'weight'])
d3[,'height']<-scale(d3[,'height'])
### Standardizing only the predictor weight
### I would probably not standardize the response variable height, but
### proceed using d3 since I did in the previous examples.
d4[,'weight']<-scale(d4[,'weight'])
Plotting data but this code requires the ggplot2 library
library(ggplot2)
ggplot(d3,aes(y=height, x=weight, color=factor(male))) + geom_point()
Using lm()
out<-lm(height~weight+male,data=d3)
summary(out,correlation=TRUE) # # correlation about -0.99
##
## Call:
## lm(formula = height ~ weight + male, data = d3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.81387 -0.32943 0.06031 0.33941 1.82744
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.39355 0.04379 -8.987 <2e-16 ***
## weight 0.53471 0.03459 15.458 <2e-16 ***
## male 0.83958 0.06922 12.129 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5518 on 349 degrees of freedom
## Multiple R-squared: 0.6973, Adjusted R-squared: 0.6955
## F-statistic: 401.9 on 2 and 349 DF, p-value: < 2.2e-16
##
## Correlation of Coefficients:
## (Intercept) weight
## weight 0.39
## male -0.74 -0.52
library(stats)
sigma(out)
## [1] 0.5517915
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 151.5501 0.3391 446.98 <2e-16 weight 4.1399 0.2678 15.46 <2e-16 male 6.5003 0.5359 12.13 <2e-16 ***
sigma 4.272153
Using our own likelihood function. Here is the likelihood function:
nllNormReg<-function(parVec,data=d3){
wt<-data$weight
ht<-data$height
male<-data$male
intercept<-parVec[1]
beta<-parVec[2]
maleEffect<-parVec[3]
mySd<-parVec[4]
htPred<-intercept+beta*wt + maleEffect*male
nllik<- -sum(dnorm(x=ht,mean=htPred,sd=mySd,log=TRUE))
return(nllik)
}
Minimizing the negative log likelihood:
parVec<-c(1.0,2.0,2.0,2.0) # Need some starting parameters for gradient climbing/descending methods
outLM<-optim(par=parVec,fn=nllNormReg,method="L-BFGS-B",
lower=c(-Inf,-Inf,-Inf,0.0001),upper=c(Inf,Inf,Inf,Inf))
outLM$par #
## [1] -0.3935526 0.5347038 0.8395813 0.5494369
These are the same from lm() above
Using quap:
library(rethinking)
modelQUAP <- quap(
alist(
height ~ dnorm( mu , sigma ) ,
mu <- intercept + beta*weight + maleEffect*male,
intercept ~ dnorm( 10 , 20 ) ,
beta ~ dnorm( 0 , 5 ) ,
maleEffect ~ dnorm( 0 , 5 ) ,
sigma ~ dunif( 0 , 50 )
) , data=d3 )
precis( modelQUAP )
## mean sd 5.5% 94.5%
## intercept -0.3934411 0.04360295 -0.4631271 -0.3237552
## beta 0.5347373 0.03444200 0.4796923 0.5897822
## maleEffect 0.8393879 0.06891719 0.7292449 0.9495308
## sigma 0.5494365 0.02070744 0.5163420 0.5825310
These results are the same as for lm() above.
The mechanisms outlined in the chapter (and the previous) were: (1) confounding (a third variable that influences the exposure and outcome) (2) multicollinearity, (3) conditioning on post-treatment variables, (4) collider bias.
This is specific to your own experience and preference and so there is no single correct answer.
The biased sample in the introduction to the chapter are those proposals that were good enough to pass a threshold and be selected for funding. A proposal can pass a threshold if either criterion (e.g., trustworthy or newsworthy) is good enough. Therefore among those selected, there tends to be a negative association between the criteria. The association is negative also in those that aren’t selected, but we don’t get to see those.
In conditioning on a collider, the sample is similarly divided into sub-samples. And the association in the sub-samples can be different than in the total sample. But we do it ourselves and it can produce similar statistical consequences as selection bias.
Think about a lamp. It can be either on or off. The lamp is caused by the switch being on and the electricity working. It is a collider. If we condition on the lamp’s state, then there are two sub- samples: those lamps that are on and those that are off. Among that that are on, both the switch and the electricity must be on. Among those that are off, at least one is off. Therefore there can be a negative association between the switch and the electricity in the sub-sample of lamps that are off, even when there is no association between the switch and electricity in the entire sample.
library(dagitty)
dag_6M1<-dagitty("dag{
U[unobserved]
V[unobserved]
X->Y
X<-U->B<-C->Y
U<-A->C
C<-V->Y }")
coordinates(dag_6M1)<-list(x=c(X=0,Y=2,U=0,A=1,B=1,C=2,V=2.5),
y=c(X=2,Y=2,U=1,A=0.5,B=1.5,C=1,V=1.5))
drawdag(dag_6M1)
There are 5 paths connecting X and Y: (1) X → Y (the causal path) (2) X←U→B←C→Y (3) X←U→B←C←V→Y (4) X←U←A→C→Y (5) X←U←A→C←V→Y
We want to leave path 1 open and make sure all of the others are closed, because they are non-causal paths that will confound inference. As before, all the paths through B are already closed, since B is a collider. So we don’t condition on B. Similarly, the new paths through both A and V are closed, because C is a collider on those paths. So it’s enough to condition on A to close all non-causal paths. If you like, you can check using dagitty:
adjustmentSets( dag_6M1 , exposure="X" , outcome="Y" )
## { A }
In each case, the procedure is the same: List all paths between X and Y and figure out which non-causal paths are open and which variables you can use to close them. We’ll consider each in turn.
Top left. There are three paths between X and Y: (1) X → Y,(2) X ← Z → Y,(3)X ← Z ← A → Y. Both (2) and (3) are open, non-causal paths. Conditioning on Z is sufficient to close both.
Top right. Again three paths: (1)X → Y,(2)X → Z → Y,(3)X → Z ← A → Y. Paths (1) and (2) are both causal. We want both open. Path (3) is non-causal, but it is also closed already because Z is a collider on that path. So no conditioning.
Bottom left. Paths: (1) X → Y,(2) X → Z ← Y,(3) X ← A → Z ← Y. Only path (1) is causal. The other paths are both closed, because both contain a collider at Z. Conditioning on Z would be a disaster. So no conditioning.
Bottom right. Paths: (1) X → Y,(2) X → Z → Y,(3) X ← A → Z → Y. Paths (1) and (2) are causal. Path (3) is an open backdoor path. To close it, we must condition on A or Z, but if condition on Z, that would close a causal path. So condition on A.
How hard this is depends upon the DAG you draw. Here is one possibility.
data(WaffleDivorce)
head(WaffleDivorce)
## Location Loc Population MedianAgeMarriage Marriage Marriage.SE Divorce
## 1 Alabama AL 4.78 25.3 20.2 1.27 12.7
## 2 Alaska AK 0.71 25.2 26.0 2.93 12.5
## 3 Arizona AZ 6.33 25.8 20.3 0.98 10.8
## 4 Arkansas AR 2.92 24.3 26.4 1.70 13.5
## 5 California CA 37.25 26.8 19.1 0.39 8.0
## 6 Colorado CO 5.03 25.7 23.5 1.24 11.6
## Divorce.SE WaffleHouses South Slaves1860 Population1860 PropSlaves1860
## 1 0.79 128 1 435080 964201 0.45
## 2 2.05 0 0 0 0 0.00
## 3 0.74 18 0 0 0 0.00
## 4 1.22 41 1 111115 435450 0.26
## 5 0.24 0 0 0 379994 0.00
## 6 0.94 11 0 0 34277 0.00
library(dagitty)
dag_6H1 <- dagitty("dag{A -> D; A -> M -> D; A <- S -> D; S -> W -> D}")
coordinates(dag_6H1) <- list(
x=c(A=0,M=1,D=2,S=1,W=2),
y=c(A=0.75,M=0.5,D=1,S=0,W=0) )
drawdag(dag_6H1)
Given this DAG, you can probably see that all we need to control for is S, an indicator for southern States. This is needed to block the backdoor paths through the other variables.
It’s very likely that there are unobserved variables that confound this inference. For example, if Waffle House tends to build in places where it is cheap to operate, then there are economic confounds that might influence both the presence of Waffle Houses and divorce rate. In that case, we’d need to measure those confounds to get an inference. Lots of social scientists believe that it is usually impossible to measure all the probably confounds in these studies. They look for natural or quasi- experiments instead.