Pseudo R square, standard error and Z-value of logistic regression

3 minute read

I was using Weka for logistic regression of categorical dataset. However, weka does not provide statistics output such as R square, z-value of coefficients. So I implemented them and list some of them here. The logistic function can be written as:

$Y = p(1 \mid x_1,x_2, …,x_p) = \frac{1}{1+e^{-(\beta_0+\beta X)}}$

, where $p$ is the number of independent variables.

Pseudo R square

Log likelihood function

Let $p_i = p(1 \mid x_1^{(i)},x_2^{(i)}, …,x_p^{(i)})$ be the prediction of $i$th instance. The log likehood function of logistic regression is

$log(L) = \sum_{i}^{n} (1 - y^{(i)}) log(1 - p_i)+y^{(i)} log(p_i))$

, where $n$ is number of the instances.

McFadden’s Pseudo R square

The McFadden’s Pseudo R square is calculated as:

$R^2 = 1 - \frac{log(L)}{log(L_{intercept})}$

$L_{intercept}$ is the likelihood of the model with only its intercept. It is treated as a total sum of squares, and the log likelihood of the full model is treated as the sum of squared errors. The ratio of the likelihoods suggests the level of improvement over the intercept model offered by the full model.

A likelihood falls between 0 and 1, so the log of a likelihood is less than or equal to zero. If a model has a very low likelihood, then the log of the likelihood will have a larger magnitude than the log of a more likely model. Thus, a small ratio of log likelihoods indicates that the full model is a far better fit than the intercept model.

Z-value and $P>|z|$ of coefficients

Covariance matrix of coeffieicnts

The standard errors of the model coefficients are the square roots of the diagonal entries of the covariance matrix. Consider the following:

Design matrix:

$ X = \begin{bmatrix} 1 & x_{1,1} & … & x_{1,p} \
1 & x_{2,1} & … & x_{2,p} \
… & … & … & … \
1 & x_{n,1} & … & x_{n,p} \end{bmatrix} $

, where $x_{i,j}$, $x_{i,j}$ is the value of the $j$th predictor for the $i$th observations. (NOTE: This assumes a model with an intercept.)

$ V = \begin{bmatrix} \hat{\pi_1}(1-\hat{\pi_1}) & 0 & 0 & 0\
0 & \hat{\pi_2}(1-\hat{\pi_2}) & 0 & 0\
… & … & … & …\
0 & 0 & 0 & \hat{\pi_n}(1-\hat{\pi_n}) \end{bmatrix} $

, where $\hat{\pi_i}$ represents the predicted probability of class membership for observation $i$. The covariance matrix can be written as:

\[(X^TVX)^{−1}\]

Standard error of coefficients

The standard errors are the square root of the diagonal of that matrix. These are the standard errors associated with the coefficients. The standard error is used for testing whether the parameter is significantly different from 0. The standard errors can also be used to form a confidence interval for the efficient.

Z-value

By dividing the coefficient by the standard error you obtain a z-value.

In statistics, the letter “Z” is often used to refer to a random variable that has a standard normal distribution. A standard normal distribution is a normal distribution with expectation 0 and standard deviation 1. This is the normal distribution that is generally tabulated in the back of any basic statistics book.

Because of this, the term “z-value” is often used to refer to the value of a statistic that has a standard normal distribution. Sometimes it is also used to refer to percentile points from the standard normal distribution that are used to compare to the value of statistic. For example, one might refer to “the z-value corresponding to a 95% confidence interval” (which would be 1.96).

$P>|z|$

A good rule of thumb is to use a cut-off value of 2 which approximately corresponds to a two-sided hypothesis test with a significance level of $\alpha=0.05$. Coefficients having a p-value of 0.05 or less would be statistically significant (i.e., you can reject the null hypothesis and say that the coefficient is significantly different from 0). If the z-value is too big in magnitude (i.e., either too positive or too negative), it indicates that the corresponding true regression coefficient is not 0 and the corresponding X-variable matters.

References

http://www.ats.ucla.edu/stat/mult_pkg/faq/general/Psuedo_RSquareds.htm

http://www.ats.ucla.edu/stat/stata/output/stata_logistic.htm

http://logisticregressionanalysis.com/1577-what-are-z-values-in-logistic-regression/#Example

http://stats.stackexchange.com/questions/89484/how-to-compute-the-standard-errors-of-a-logistic-regressions-coefficients

Updated:

Leave a Comment