Overview of Logistic Regression:
In my last article, we dove into the purpose and methodology of linear/multiple linear regression analysis. Linear regression analysis is used when you want to predict a continuous dependent (y, response) variable using a number of independent (x, explanatory) variables.
But what happens when you want to predict a categorical outcome using a linear model? You might run into some trouble because a continuous variable outcome will be much harder to interpret (and most likely will not make sense). In fact, we will need to use a different model in order to predict our categorical response variable. How can we fix this problem? Luckily, our friend logistic regression allows us to make a binary prediction by taking the outcome through a transformation, such that the binary outcome can be represented as a probability of belonging to one class.
Assumptions of Linear Regression:
- Require the independent variable to be a binary outcome (0 or 1, True or False, Success or Failure).
- The factor level 1 of the dependent variable should represent the desired outcome (i.e. 1 would = Success or True).
- Logistic regression does not require a linear relationship between the dependent and independent variables.
- Error terms do not need to be normally distributed.
- Only the meaningful variables should be included.
- Independent variables should be independent of each other — should have little or no multicollinearity.
- Independent variables, through transformation, are linearly related to the log odds.
- Must have a large sample size. A general rule of thumb is (Independent variables * 10 / probability of least frequent outcome). For example, if you have 4 independent variables and the expected probability is .10, then you would want to have a minimum sample size of 400
Differences between Logistic and Linear Regression:
What is the difference between classification and regression analysis? Simply put, the outcome of regression is continuous and the outcome for classification is categorical!
Below, you see our standard linear regression model. Again, this is great for understanding continuous variables, but when trying to determine a categorical variable, we need to use a different function.
Logistic regression predicts binary (true or false, 0 or 1, no or yes) values; for example, whether a mouse is obese. Zero would be the outcome if the mice is not obese, and one would be the outcome if the mouse is obese. In this sense, logistic regression can tell us the probability of whether a mouse is obese or not. Logistic regression is thus used for classification purposes. If the probability of a mouse being obese is greater than 50% (the standard level for classification) then we will classify it as obese. If it is below 50%, than we will classify it as not obese. Instead of y = x, we are going to model the probability of x. We use linear regression to solve regression problems and logistic regression to solve classification problems.
Logistic regression can work with both continuous data (like weight) and discrete data (like gender or genotype). When we have multiple variables, we test to see if a variable’s effect on the outcome is significantly different from 0. If it is, it means the variable is helping the prediction.
Another key difference between linear regression and logistic regression is that linear regression tries to establish a linear relationship between the independent variable(s) and dependent variable. This relationship between our variables is is not necessary for logistic regression.
Finally, in linear regression, we find the optimal line as the one that minimizes the sum of the squares it’s residuals. In other words, we fit the line using “least squares”. We also calculate r-squared using the residuals, which shows us how much variation in the data can be explained by the model.
On the other hand, logistic regression does not have the same concept of a residual, meaning we cannot use the least squares method or r-squared to evaluate our model. Instead, we use something called “maximum likelihood”. We will explain this topic more later on in this article.
Below are two graphs showing the differences in a linear regression graph and a logistic regression graph. Note that logistic regression predicts Y to be within a range of 0 to 1, while linear regression can predict Y to exceed that range (and usually does).
With a little bit of transformation, we can create a model that identifies the probability of a given categorical variable. Our outcome is binary, so we need to confine it to a probability from 0 to 1 for any given values of the predictor(s). Here, we will use the Sigmoid function, which will limit our response variables range from 0 to 1.
Let P(y = 1|X) = p(X),
p(X) = eB0+B1X1/(1+ eB0+B1X1)
We can manipulate this equation to get the following:
p(X)/(1-p(X)) = eB0+B1X1
We then take the log of both sides and get the following function:
log(p(X)/(1-p(X))) = B0+B1X1
Understanding our Coefficients:
Understanding logistic coefficients isn’t easy at first. First, if we look at the beginning part of our derived equation we see that the probability of x over the probability of 1 the probability of x is actually the odds of our probability. We log the odds in order to make this a continuous
The coefficients within our model are the logged odds of the probability of X, or that the response value is a 1 (or success). For example, if you Female = 0 and Male = 1, and a logistic regression coefficient is 0.014, then you can assert that the odds of your outcome for men are exp(0.014), so the odds of a success would be 1.01. In order to find the probability, we would need to plug in our coefficients back into out initial equation p(X) = eB0+B1X1/(1+ eB0+B1X1) in order to find what the given probability is for a specific value.
Below is a good graph to show us the relationship between probability, odds and log odds. The transformation from probability to odds is a monotonic transformation. This means that as probability increases, so do the odds! While probability ranges from 0 to 1, odds range from 0 to positive infinity.
The transformation from odds to log odds is also monotonic, as an increase in odds will increase log odds. There are multiple reasons why we transform probability into log odds. First off, it is usually harder to model a variable that has a restricted range (i.e. probability). By using the log odds, we can transform the equation so it maps probability ranging from 0 to 1 to the logged odds of probability, which is a continuous value ranging from (-)infinity to (+)infinity. Also, the log of odds is much easier to understand and interpret then other transformations.
There are multiple ways to estimate the parameters for a given model. Thinking back to linear regression, we uses least squares optimization to find the coefficients that would give us the lowest error within our model. When considering logistic regression, we use maximum likelihood estimation to find our best model. The parameter values are found such that they maximize the likelihood that the process described by the model produced the data that were actually observed. Basically, we want to find the model that reduces the amount of error between the predicted and actual values!
We have explained all of the basics of logistic regression. We know that logistic regression is used to predict the dependent value when the variable is dichotomous (binary). Our coefficients in logistic regression are expressed as the log odds of the probability of X. We use Maximum Likelihood Estimate to find the best fit model for our data. Logistic regression is great for classification, but has limitations and should only be used on large data sets!