Project status license

Import data

draft_unreal <- read.csv("C:/Users/GrandProf/Downloads/Repos_4cleanup/Repositories_AP7/On_GitHub/BinomReg/data/draft_unreal.csv")

Test for Correlation

draft_unreal$initial_prob <- ifelse(draft_unreal$draft==1,0.6,0.4)
draft_unreal$initial_logit <- ifelse(draft_unreal$draft==1,log(0.6/1.6),(1-log(0.6/1.6)))
draft_cormatrix <- cor(draft_unreal)
library(corrplot) # me preferably not ggcorrplot
## corrplot 0.92 loaded
corrplot(cor(draft_unreal))

Why Binomial Regression?

  • Poor correlation between Bernoulli observations and predictors
  • Poor Correlation between derived or calculated probabilities and predictors
  • Poor correlation between the log(OR) derived from initial probabilities and predictors
  • Unable to fit predictor variables to a commonly known models hence making it difficult to make a judgement call as to IF the predictors should be included in the binomial regression equation
  • Unable to fit the response variable to commonly known models (normal etc.) hence leading to difficulties in interpreting the results of the binomial regression

Why the use of a Logit function?

  • Constraints on using the probability directly (could lead to negative values in the probability).
  • Other mathematical constraints exists when taking the LN(prob) which could lead to probability values out of range
  • The solution is the Logit link which ensures that probability values obtained are within range

Logit function

\[\frac{π_i}{1-π_i} = \frac{e^{β_o+β_1 x_i }}{1 + e^{β_o+β_1 x_i}}\]

\[θ_{i}=β_o+β_1 x_i\]

Hypothesis

Mathematical

  • NULL: \(Log(\frac{π_i}{1-π_i}) \neq 0\)
  • ALTERNATE: \(Log(\frac{π_i}{1-π_i}) = 0\)

Procedure

  1. Guess initial coefficients (\(β_o\), \(β_1\) etc.) of \(θ_i\)
  2. Estimate \(θ_i\)
  3. Exponentiation of \(θ_i\)
  4. Calculate the probability of success \(π_i\) vs failure \(1 - π_i\)
  5. Logit = \(Log(π_i)\) for 1 vs \(Log(1-π_i)\) for 0
  6. Use a maximization function to determine the final coefficients
  7. Remember that the actual probability is \(θ_i\)

Procedure in R

m1 <- glm(draft ~ pts + rebs + ast, family = binomial, data = draft_unreal)
summary(m1)
## 
## Call:
## glm(formula = draft ~ pts + rebs + ast, family = binomial, data = draft_unreal)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -2.10071    2.95905  -0.710   0.4777  
## pts         -0.01430    0.09372  -0.153   0.8788  
## rebs         0.61073    0.34982   1.746   0.0808 .
## ast         -0.12747    0.22178  -0.575   0.5654  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26.920  on 19  degrees of freedom
## Residual deviance: 20.431  on 16  degrees of freedom
## AIC: 28.431
## 
## Number of Fisher Scoring iterations: 4
coef(m1)[1] #just testing manipulation of the model
## (Intercept) 
##   -2.100706
  • significance for rebounds at the 0.1 level (Not real data so one can assume that points and assist=making teammates better, will get one into the NBA)

  • Test the model fit

    Null: Model with p parameters fits well enough
    Alternate: The model with p parameters does not fit well
    Interpretation: p value > 0.05 indicates acceptance of the null     hypothesis
plot(density(residuals(m1)))

pchisq(20.431, 16, lower = FALSE)
## [1] 0.2014327

Beta Binomial Practice

  • This accounts for overdispersion
library(aod)
#m2 <- betabin( draft ~ pts + rebs + ast, ~ pts + rebs + ast, data=draft_unreal)
#summary(m2)
# Unable to run a beta binomial regression with bernoulli observations

Coefficient Determination

  • Marginal PMF (a function \(π_i\), and k=0,1,2…n)
  • Joint PMF (a function of the binomial distribution)
  • Likelihood function(\(L(β)\) - Like the PMF but a function of \(β\))
  • Log likelihood function (\(l(β)\)) - A non -linear function
  • Maximize
  • Log likelihood function:

\[\mathbf{l(β)}=\sum_{i=1}^{n} [{k_i}{n_i} - {n_i}{log(1+e^{n_i})} + log{\binom{n_i}{k_i}}], n_i = predictors \]

Interpretation

  • Interpreted as a normal distribution because the PMF is shaped like a normal distribution at certain p values
  • The shape of the PMF will determine interpretations of the results. This means it could be interpreted as a geometric distribution based on the p value
  • As a normal distribution, z values are calculated and compared to standard values \[ z= \frac{β_j-0}{\sqrt{var(β_j)}}\]

Binomial vs Poisson

  • Poisson best suited for an infinite number of trials
  • Moreover, n should be large and p small when using a Poisson distribution to approximate a binomial distribution

GLM Pitfalls

  • Unable to maximize an already maximized coefficient (shows up when analyzing groups that are skewed meaning all successes are confined to one group vs failures in the adjacent group).
  • A possible phenomenon in biological analysis (like methylation coverage)

Conclusion

  • This is witnessing the evolution of a Linear regression model to a Binomial logistic regression model (Naive Bayes).
  • The Primed Bayes is the next step in this evolution.