“Life is the sum of all your choices.”
Demystifying Logistic Regression in Machine Learning: There are varying interpretations of this quote by noted French absurdist, Albert Camus. Choices are considered to be inherent to our day-to-day lives. Some of them are taken in a whisker while others require much more consideration. In the realm of business and corporate affairs, these may often be the difference between amassing unprecedented profits or losing out on them under the pretext of opportunity costs.
In the realm of data science, decision-making is touted highly; one of its renditions is regression analysis, which is manifested in various forms with the most common one being the ubiquitous and good old linear regression. It is suitable for studying a linear relationship between cause(s) (also known as inputs) and the effect.
Popular examples which you might be familiar with include predicting continuous variables like house price (on the basis of its floor size, location in/outside the city, distance from amenities, etc.) and final average college grade at graduation (with the help of regressors like number of hours put in for studying, performance in entrances like the SAT, employment status, etc.).
While we can have independent variables which are categorical (for example, whether the house is situated within the city or whether the student in question is employed), it is notable that the dependent one is necessarily taken to be numeric. Therefore, while a linear model can predict the cost of the house or a person’s GPA, one cannot expect it to comment on whether the house was bought or the student got admitted to her desired institution in the first place!
This is where logistic regression comes to our rescue. It is especially suitable for answering fundamental questions which resemble the ones that were raised before. Its simplest form is also known as the binary regression model, which is utilized for answering yes/no questions. Think of the SAT v/s admission example, for instance, for a particular institution. Successful admissions would be signified by 1 while unsuccessful ones with a 0.
It is known that the SAT isn’t the sole factor that influences admission to college courses. However, its role is demonstrated aptly when we see relatively low scores (on the left-bottom corner) and relatively high scores (on the top right) being treated in a similar fashion every single time. The scores in the region between them, however, may/may not be selected, depending upon a host of other factors that are discussed later.
If one were to use a linear regression model for fitting the above dataset, the results would look like this:
The model seems to be in complete oblivion to the binary nature of admission. The Admitted variable is treated just like any other numeric quantity, specifically like CarPrice and GPA with implausible answers (beyond 0 and 1, the extremes) getting featured in the chart. Commenting on its accuracy isn’t feasible, which leaves us with no conclusions. Predicting the status of a specific SAT Score isn’t reliable as well.
A logistic regression model, on the other hand, takes the discreteness and binaryness of the Admitted variable into consideration. This is enunciated better with the graphical representation of the regression curve:
Instead of modeling the dependent variable as merely a number, it is treated as an odd. An odd is the ratio of the probability of the occurrence of an event with its nonoccurrence. For example, for a deck of 52 cards:
- The odds of drawing a king of diamonds are 1:51 (read as one-to-fifty-one)
- The odds of drawing a black card is 1:1 (obtained on simplifying 26:26)
The odds of admittance are plotted against the score obtained, which thus brings probability into the picture. To state simply, for every additional mark scored in the SAT, beyond a threshold (in this case, the score after which there is a non-zero possibility of admission), the odds of getting admitted increase by a fixed percentage, which varies with every model. In this case, it was 1.75%.
This simple model can now be extended to include other relevant factors as well, which thereby manifests multivariate logistic regression. For this example, we can incorporate aspects like the nature of essays submitted, performance in extracurriculars, recommendations, economic status, gender, nationality, etc. This will make our model more powerful, as it will be able to explain the dataset better.
Further, we can also have situations where the number of output classes is more than two. This is known as multinomial regression. We can, for instance, predict a customer’s satisfaction on a scale between one and five, based on the amount of time they spent on the company’s platform before buying a product. This is synonymous with classification. Notably, clustering is also a classification technique, which intends to segregate data points into clusters based on a common property.
We must understand that linear regression in itself isn’t fallacious; it’s the nature of data involved in these cases which doesn’t satisfy the underlying assumptions (known as OLS) of the linear model and thus render it irrelevant. The two paradigms are not very different if one studies them in detail. This explains why logistic regression is considered to be a misnomer for softmax regression by some machine learning practitioners.
In conclusion, a logistic regression model acts just like the one which is based on linear regression but for datasets with a different nature: we can check for its statistical significance, retrieve relevant parameters (the weights and the bias) and predict the dependent variable for unseen values. It enables efficient, data-backed, and effective decision-making, which is an alias for the right choices nobody ever made.