Import data
draft_unreal <- read.csv("C:/Users/GrandProf/Downloads/Repos_4cleanup/Repositories_AP7/On_GitHub/BinomReg/data/draft_unreal.csv")
Test for Correlation
draft_unreal$initial_prob <- ifelse(draft_unreal$draft==1,0.6,0.4)
draft_unreal$initial_logit <- ifelse(draft_unreal$draft==1,log(0.6/1.6),(1-log(0.6/1.6)))
draft_cormatrix <- cor(draft_unreal)
library(corrplot) # me preferably not ggcorrplot
## corrplot 0.92 loaded
corrplot(cor(draft_unreal))
Why Binomial Regression?
- Poor correlation between Bernoulli observations and predictors
- Poor Correlation between derived or calculated probabilities and
predictors
- Poor correlation between the log(OR) derived from initial
probabilities and predictors
- Unable to fit predictor variables to a commonly known models hence
making it difficult to make a judgement call as to IF the predictors
should be included in the binomial regression equation
- Unable to fit the response variable to commonly known models (normal
etc.) hence leading to difficulties in interpreting the results of the
binomial regression
Why the use of a Logit function?
- Constraints on using the probability directly (could lead to
negative values in the probability).
- Other mathematical constraints exists when taking the LN(prob) which
could lead to probability values out of range
- The solution is the Logit link which ensures that probability values
obtained are within range
Logit function
\[\frac{π_i}{1-π_i} = \frac{e^{β_o+β_1 x_i
}}{1 + e^{β_o+β_1 x_i}}\]
\[θ_{i}=β_o+β_1 x_i\]
Hypothesis
Mathematical
- NULL: \(Log(\frac{π_i}{1-π_i}) \neq
0\)
- ALTERNATE: \(Log(\frac{π_i}{1-π_i}) =
0\)
Procedure
- Guess initial coefficients (\(β_o\), \(β_1\) etc.) of \(θ_i\)
- Estimate \(θ_i\)
- Exponentiation of \(θ_i\)
- Calculate the probability of success \(π_i\) vs failure \(1 - π_i\)
- Logit = \(Log(π_i)\) for 1 vs \(Log(1-π_i)\) for 0
- Use a maximization function to determine the final coefficients
- Remember that the actual probability is \(θ_i\)
Procedure in R
m1 <- glm(draft ~ pts + rebs + ast, family = binomial, data = draft_unreal)
summary(m1)
##
## Call:
## glm(formula = draft ~ pts + rebs + ast, family = binomial, data = draft_unreal)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.10071 2.95905 -0.710 0.4777
## pts -0.01430 0.09372 -0.153 0.8788
## rebs 0.61073 0.34982 1.746 0.0808 .
## ast -0.12747 0.22178 -0.575 0.5654
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26.920 on 19 degrees of freedom
## Residual deviance: 20.431 on 16 degrees of freedom
## AIC: 28.431
##
## Number of Fisher Scoring iterations: 4
coef(m1)[1] #just testing manipulation of the model
## (Intercept)
## -2.100706
significance for rebounds at the 0.1 level (Not real data so one
can assume that points and assist=making teammates better, will get one
into the NBA)
Test the model fit
Null: Model with p parameters fits well enough
Alternate: The model with p parameters does not fit well
Interpretation: p value > 0.05 indicates acceptance of the null hypothesis
plot(density(residuals(m1)))
pchisq(20.431, 16, lower = FALSE)
## [1] 0.2014327
Beta Binomial Practice
- This accounts for overdispersion
library(aod)
#m2 <- betabin( draft ~ pts + rebs + ast, ~ pts + rebs + ast, data=draft_unreal)
#summary(m2)
# Unable to run a beta binomial regression with bernoulli observations
Coefficient Determination
- Marginal PMF (a function \(π_i\),
and k=0,1,2…n)
- Joint PMF (a function of the binomial distribution)
- Likelihood function(\(L(β)\) - Like
the PMF but a function of \(β\))
- Log likelihood function (\(l(β)\))
- A non -linear function
- Maximize
- Log likelihood function:
\[\mathbf{l(β)}=\sum_{i=1}^{n} [{k_i}{n_i}
- {n_i}{log(1+e^{n_i})} + log{\binom{n_i}{k_i}}], n_i =
predictors \]
Interpretation
- Interpreted as a normal distribution because the PMF is shaped like
a normal distribution at certain p values
- The shape of the PMF will determine interpretations of the results.
This means it could be interpreted as a geometric distribution based on
the p value
- As a normal distribution, z values are calculated and compared to
standard values \[
z= \frac{β_j-0}{\sqrt{var(β_j)}}\]
Binomial vs Poisson
- Poisson best suited for an infinite number of trials
- Moreover, n should be large and p small when using a Poisson
distribution to approximate a binomial distribution
GLM Pitfalls
- Unable to maximize an already maximized coefficient (shows up when
analyzing groups that are skewed meaning all successes are confined to
one group vs failures in the adjacent group).
- A possible phenomenon in biological analysis (like methylation
coverage)
Conclusion
- This is witnessing the evolution of a Linear regression model to a
Binomial logistic regression model (Naive Bayes).
- The Primed Bayes is the next step in this evolution.