A Guide to Logistic Regression for Beginners

Overview of Logistic Regression:

In my last article, we dove into the purpose and methodology of linear/multiple linear regression analysis. Linear regression analysis is used when you want to predict a continuous dependent (y, response) variable using a number of independent (x, explanatory) variables.

  • The factor level 1 of the dependent variable should represent the desired outcome (i.e. 1 would = Success or True).
  • Logistic regression does not require a linear relationship between the dependent and independent variables.
  • Error terms do not need to be normally distributed.
  • Only the meaningful variables should be included.
  • Independent variables should be independent of each other — should have little or no multicollinearity.
  • Independent variables, through transformation, are linearly related to the log odds.
  • Must have a large sample size. A general rule of thumb is (Independent variables * 10 / probability of least frequent outcome). For example, if you have 4 independent variables and the expected probability is .10, then you would want to have a minimum sample size of 400

Differences between Logistic and Linear Regression:

What is the difference between classification and regression analysis? Simply put, the outcome of regression is continuous and the outcome for classification is categorical!

Sigmoid Function:

With a little bit of transformation, we can create a model that identifies the probability of a given categorical variable. Our outcome is binary, so we need to confine it to a probability from 0 to 1 for any given values of the predictor(s). Here, we will use the Sigmoid function, which will limit our response variables range from 0 to 1.

Summary

We have explained all of the basics of logistic regression. We know that logistic regression is used to predict the dependent value when the variable is dichotomous (binary). Our coefficients in logistic regression are expressed as the log odds of the probability of X. We use Maximum Likelihood Estimate to find the best fit model for our data. Logistic regression is great for classification, but has limitations and should only be used on large data sets!

A budding data scientist!