Import data

draft_unreal <- read.csv("C:/Users/GrandProf/Downloads/Repos_4cleanup/Repositories_AP7/On_GitHub/BinomReg/data/draft_unreal.csv")

Test for Correlation

draft_unreal$initial_prob <- ifelse(draft_unreal$draft==1,0.6,0.4)
draft_unreal$initial_logit <- ifelse(draft_unreal$draft==1,log(0.6/1.6),(1-log(0.6/1.6)))
draft_cormatrix <- cor(draft_unreal)
library(corrplot) # me preferably not ggcorrplot

## corrplot 0.92 loaded

corrplot(cor(draft_unreal))

Why Binomial Regression?

Poor correlation between Bernoulli observations and predictors
Poor Correlation between derived or calculated probabilities and predictors
Poor correlation between the log(OR) derived from initial probabilities and predictors
Unable to fit predictor variables to a commonly known models hence making it difficult to make a judgement call as to IF the predictors should be included in the binomial regression equation
Unable to fit the response variable to commonly known models (normal etc.) hence leading to difficulties in interpreting the results of the binomial regression

Why the use of a Logit function?

Constraints on using the probability directly (could lead to negative values in the probability).
Other mathematical constraints exists when taking the LN(prob) which could lead to probability values out of range
The solution is the Logit link which ensures that probability values obtained are within range

Logit function

\[\frac{π_i}{1-π_i} = \frac{e^{β_o+β_1 x_i }}{1 + e^{β_o+β_1 x_i}}\]

\[θ_{i}=β_o+β_1 x_i\]

Hypothesis

Mathematical

NULL: \(Log(\frac{π_i}{1-π_i}) \neq 0\)
ALTERNATE: \(Log(\frac{π_i}{1-π_i}) = 0\)

Procedure

Guess initial coefficients (\(β_o\), \(β_1\) etc.) of \(θ_i\)
Estimate \(θ_i\)
Exponentiation of \(θ_i\)
Calculate the probability of success \(π_i\) vs failure \(1 - π_i\)
Logit = \(Log(π_i)\) for 1 vs \(Log(1-π_i)\) for 0
Use a maximization function to determine the final coefficients
Remember that the actual probability is \(θ_i\)

Procedure in R

m1 <- glm(draft ~ pts + rebs + ast, family = binomial, data = draft_unreal)
summary(m1)

## 
## Call:
## glm(formula = draft ~ pts + rebs + ast, family = binomial, data = draft_unreal)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -2.10071    2.95905  -0.710   0.4777  
## pts         -0.01430    0.09372  -0.153   0.8788  
## rebs         0.61073    0.34982   1.746   0.0808 .
## ast         -0.12747    0.22178  -0.575   0.5654  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26.920  on 19  degrees of freedom
## Residual deviance: 20.431  on 16  degrees of freedom
## AIC: 28.431
## 
## Number of Fisher Scoring iterations: 4

coef(m1)[1] #just testing manipulation of the model

## (Intercept) 
##   -2.100706

significance for rebounds at the 0.1 level (Not real data so one can assume that points and assist=making teammates better, will get one into the NBA)

Test the model fit

Null: Model with p parameters fits well enough
Alternate: The model with p parameters does not fit well
Interpretation: p value > 0.05 indicates acceptance of the null     hypothesis

plot(density(residuals(m1)))

pchisq(20.431, 16, lower = FALSE)

## [1] 0.2014327

Beta Binomial Practice

This accounts for overdispersion

library(aod)
#m2 <- betabin( draft ~ pts + rebs + ast, ~ pts + rebs + ast, data=draft_unreal)
#summary(m2)
# Unable to run a beta binomial regression with bernoulli observations

Coefficient Determination

Marginal PMF (a function \(π_i\), and k=0,1,2…n)
Joint PMF (a function of the binomial distribution)
Likelihood function(\(L(β)\) - Like the PMF but a function of \(β\))
Log likelihood function (\(l(β)\)) - A non -linear function
Maximize
Log likelihood function:

\[\mathbf{l(β)}=\sum_{i=1}^{n} [{k_i}{n_i} - {n_i}{log(1+e^{n_i})} + log{\binom{n_i}{k_i}}], n_i = predictors \]

Interpretation

Interpreted as a normal distribution because the PMF is shaped like a normal distribution at certain p values
The shape of the PMF will determine interpretations of the results. This means it could be interpreted as a geometric distribution based on the p value
As a normal distribution, z values are calculated and compared to standard values \[ z= \frac{β_j-0}{\sqrt{var(β_j)}}\]

Binomial vs Poisson

Poisson best suited for an infinite number of trials
Moreover, n should be large and p small when using a Poisson distribution to approximate a binomial distribution

GLM Pitfalls

Unable to maximize an already maximized coefficient (shows up when analyzing groups that are skewed meaning all successes are confined to one group vs failures in the adjacent group).
A possible phenomenon in biological analysis (like methylation coverage)

Conclusion

This is witnessing the evolution of a Linear regression model to a Binomial logistic regression model (Naive Bayes).
The Primed Bayes is the next step in this evolution.

Binomial Logistic Regression (Linear Regression Evolution