Decoding Logistic Regression: The Math Behind Binary Classification

Explore the foundational mechanics of logistic regression, from the essential sigmoid transformation that converts linear outputs into probabilities to the logarithmic cost function that punishes confident errors. Understanding these core concepts reveals why logistic regression remains indispensable for classification tasks in machine learning.

Logistic regression serves as a cornerstone of binary classification in machine learning, transforming linear combinations into actionable probabilities. Unlike linear regression predicting continuous values, it answers yes/no questions—like diagnosing a disease or detecting spam—by estimating the likelihood an input belongs to class 1 (versus class 0).

The Sigmoid: Bridging Linear Outputs and Probabilities

The algorithm starts identically to linear regression: computing a weighted sum of inputs (z = w*x + b). Yet this raw score spans negative to positive infinity, incompatible with probability constraints (0 to 1). Enter the sigmoid function:

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

This S-shaped curve acts as a "squasher," mapping any real number to (0,1). For example, z=0 yields 0.5; z=4.6 yields ~0.99. The sigmoid’s differentiability and intuitive probability interpretation make it ideal for gradient-based optimization.

Log Loss: Penalizing Confidence in Error

Measuring prediction accuracy requires a specialized cost function. Mean Squared Error (MSE)—common in linear regression—creates a non-convex "bumpy" loss landscape when applied to sigmoid outputs, trapping optimization in local minima. Log Loss (Binary Cross-Entropy) solves this:

J(θ) = -1/m * Σ [y_i * log(h(x_i)) + (1-y_i) * log(1 - h(x_i))]

This elegant formula:

Heavily penalizes confidently wrong predictions (e.g., predicting 0.99 for a true y=0 case)
Uses logarithms to make costs approach infinity as incorrect certainty increases
Functions as a "mathematical switch," activating only the relevant term based on the true label (y_i)

Minimizing this loss via gradient descent adjusts weights to align probabilities with reality. The result? A robust classifier balancing interpretability with efficacy—still widely deployed in fraud detection, medical diagnostics, and algorithmic decision systems decades after its inception.

Source: Mateo Lafalce's technical blog