The Significance of Normalization and Standardization in Feature Scaling

Shaily Mishra
9 min readMar 5, 2024

--

Photo by Clay Banks on Unsplash

Table of Content:

  • Why we need to do feature scaling before applying learning algorithms?
  • What is Normalization and its properties? Explanation with an example
  • What is Standardization and its properties? Explanation with an example
  • Summary : Normalization VS Standardization

In the realm of machine learning, feature scaling is a crucial preprocessing step that ensures the fair treatment of features and contributes to the effectiveness of predictive models. In this blog post, we’ll delve into the importance of two popular feature scaling techniques: normalization and standardization.

Consider a scenario where we have a dataset containing the scores of two exams: Exam 1 and Exam 2. Our objective is to predict student success based on these exam scores. Our dataset consists of 10 data points, each representing a student’s scores on Exam 1 and Exam 2, along with their pass/fail status. Here’s a summary of the dataset:

| Exam 1 Score | Exam 2 Score | Pass/Fail |
|--------------|--------------|-----------|
| 9 | 790 | Pass |
| 8 | 770 | Pass |
| 9 | 880 | Pass |
| 7 | 960 | Pass |
| 8 | 790 | Pass |
| 2 | 800 | Fail |
| 1 | 900 | Fail |
| 2 | 850 | Fail |
| 3 | 750 | Fail |
| 6 | 600 | Fail |

Let’s consider predicting the outcome for a new student with an Exam 1 score of 1 and an Exam 2 score of 790 which belongs to fail category. Let’s perform the K-Nearest Neighbors (KNN) algorithm with k=3 to predict.

First, we need to calculate the distances between the new student (1, 790) and all other points in the dataset. Then, we’ll select the k=3 nearest neighbors based on these distances and determine the majority class among them to make the prediction. Nearest Neighbors of (1,790) are Student 1 : (9,790) — Pass, Student 2 : (8,790) — Pass, and Student 6 : (2,800) — Fail. The majority class among these neighbors is “Pass”, so the predicted label for the new student (1,790) is “Pass”.

The challenge we encounter in this dataset lies in the discrepancy between the scales of Exam 1 scores and Exam 2 scores. Examining the data, we notice that the range of scores for Exam 1 is significantly different from that of Exam 2. For instance, Exam 1 scores range from 1 to 9, while Exam 2 scores vary from 600 to 960. This discrepancy in scales can pose significant challenges when using these features directly for prediction tasks.

When applying machine learning algorithms, particularly distance-based algorithms like K-Nearest Neighbors (KNN), the scale of features plays a critical role in determining the similarity between data points. In our case, without addressing the scale discrepancy, the model might inadvertently prioritize one exam over the other due to its larger scale. Consequently, this could lead to biased predictions, where the influence of one feature dominates the prediction outcome.

Furthermore, different scales between features can also affect the convergence speed and performance of optimization algorithms used in various machine learning models. Algorithms like gradient descent, which rely on feature scaling for efficient convergence, may experience challenges in reaching the optimal solution when faced with features of vastly different scales.

Therefore, addressing the scale discrepancy between features is crucial to ensure fair treatment and unbiased predictions in machine learning models. Techniques like normalization and standardization offer effective solutions to mitigate these challenges by scaling features to a common range or distribution, thereby enabling the model to make more reliable and accurate predictions.

What is Normalization?

Normalization, i.e., Min-Max scaling is a preprocessing technique used in machine learning to scale numerical features to a common range. The goal of normalization is to transform the data so that all features have similar scales, typically within the range of [0, 1]. This ensures that no single feature dominates the analysis or influences the outcome disproportionately due to its scale.

Key Characteristics of Normalization

  1. Uniform Scaling: Normalization brings all features to a uniform scale of [0, 1], making it invaluable when features span varied ranges.
  2. Use Cases: It is especially beneficial in scenarios where:
  • The data must fit within a specific bounded range.
  • The algorithms in use are sensitive to the magnitude of values, such as neural networks and distance-based algorithms like k-Nearest Neighbors (KNN) and clustering algorithms like k-Means.

3. Distance Implications: The absolute distances between data points are altered due to the uniform rescaling. The relative distances between data points can also change, which may impact the performance of distance-based algorithms by modifying the data’s geometry.

4. Distribution Preservation: Normalization does not alter the shape of the data’s distribution; it merely rescales the values to fit within the chosen range.

5. Sensitivity to Outliers: This technique is more sensitive to outliers because the scaling is based on the minimum and maximum values of the dataset. Outliers can significantly skew these values, affecting the overall scaling process.

6. Algorithm Preference: Normalization is particularly advantageous for models that thrive on uniformly scaled data. Neural networks, which can suffer from slow or unstable convergence when dealing with features of varying scales, often see improved training performance with normalized inputs. Similarly, algorithms that rely on calculating the distance between data points, such as k-Nearest Neighbors (KNN) and clustering methods like k-Means, benefit significantly from normalization. These algorithms depend on uniform feature scales to ensure fair and accurate distance measurements, making normalization crucial for their effectiveness.

Scenario: A dataset represents the scores of two different video games (Game A and Game B) played by several users. Game A scores range from 0 to 100, while Game B scores range from 0 to 10000. We want to predict a user’s satisfaction level based on these scores using a simple neural network. Here, normalization is expected to perform better because it scales all features (game scores) to the same range, facilitating the neural network’s learning process.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error

# Generating synthetic data
np.random.seed(42)
scores_game_a = np.random.randint(0, 101, 1000) # Scores from 0 to 100
scores_game_b = np.random.randint(0, 10001, 1000) # Scores from 0 to 10000
satisfaction = scores_game_a * 0.5 + scores_game_b * 0.0005 # Simplified satisfaction formula

X = np.column_stack((scores_game_a, scores_game_b))
y = satisfaction

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalization
scaler_norm = MinMaxScaler()
X_train_norm = scaler_norm.fit_transform(X_train)
X_test_norm = scaler_norm.transform(X_test)

# Standardization
scaler_stand = StandardScaler()
X_train_stand = scaler_stand.fit_transform(X_train)
X_test_stand = scaler_stand.transform(X_test)

# Neural Network Model
model_org = MLPRegressor(random_state=42, max_iter=1000).fit(X_train, y_train)
model_norm = MLPRegressor(random_state=42, max_iter=1000).fit(X_train_norm, y_train)
model_stand = MLPRegressor(random_state=42, max_iter=1000).fit(X_train_stand, y_train)

# Performance Evaluation
mse_org = mean_squared_error(y_test, model_norm.predict(X_test))
mse_norm = mean_squared_error(y_test, model_norm.predict(X_test_norm))
mse_stand = mean_squared_error(y_test, model_stand.predict(X_test_stand))

mse_org, mse_norm, mse_stand
# Output
# mse_org : 850405300.728131
# mse_norm : 0.006917
# mse_stand : 0.056782

What is Standardization?

Standardization, i.e., Z-score normalization, transforms the features of a dataset so that they each have a mean of 0 and a standard deviation of 1. The process ensures that each feature contributes equally to the model by balancing the mean and scaling the variance.

Key Characteristics of Standardization

  1. Uniformity in Scale: Standardization ensures all features are centered around zero and scaled to have unit variance, facilitating models that rely on standardized data input.
  2. Use Cases: It is particularly useful in scenarios such as:
    — When the data follows a Gaussian distribution or when the model assumes normally distributed input data.
    — For models that are less influenced by outliers or for which the importance of features is not tied to the variance of the data.
  3. Distance Metrics: The absolute distances between data points change due to the adjustment in scale and centering. The relative distances between data points, however, are preserved, maintaining the data’s geometric relationships.
  4. Distribution Preservation: Standardization maintains the shape of the original data distribution, merely shifting and scaling it to standardize around zero with unit variance.
  5. Outlier Sensitivity: Compared to normalization, standardization is less affected by outliers since it does not rely on the minimum and maximum values but instead uses the mean and standard deviation, which are more robust to extreme values.
  6. Algorithm Preference: Standardization is favored by a variety of algorithms, especially those where the assumption of normally distributed data enhances performance or is a prerequisite. Support Vector Machines (SVM) and Principal Component Analysis (PCA) greatly benefit from standardized data, as these methods are sensitive to the scale of the input features. Linear regression models, which often assume normally distributed errors and benefit from features that are on the same scale, also perform better with standardized inputs. The preservation of relative distances makes standardization particularly effective for these algorithms, ensuring that the geometric and distributional properties of the data are suitable for the underlying mathematical models.

For example, we create a regression problem using synthetic data, where the goal is to predict a target variable y based on two features with different means and variances. This setup is particularly suited to demonstrate when standardization is a more effective preprocessing step than normalization, especially when using Ridge Regression, which includes a regularization term that assumes all features are centered around zero with equal variances.

from sklearn.datasets import make_regression

from sklearn.model_selection import train_test_split

from sklearn.linear_model import Ridge

from sklearn.preprocessing import MinMaxScaler, StandardScaler

from sklearn.metrics import mean_squared_error



# Generate synthetic data

X, y = make_regression(n_samples=1000, n_features=2, noise=10)

X[:, 1] = X[:, 1] * 10 + 50 # Create features with different means and variances



# Split the dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Normalize the data

scaler_norm = MinMaxScaler()

X_train_norm = scaler_norm.fit_transform(X_train)

X_test_norm = scaler_norm.transform(X_test)


# train on orginial

model_org = Ridge(random_state=42)

model_org.fit(X_train, y_train)

y_pred_org = model_norm.predict(X_test)

mse_org = mean_squared_error(y_test, y_pred_org)


# Train a Ridge Regression model on normalized data

model_norm = Ridge(random_state=42)

model_norm.fit(X_train_norm, y_train)

y_pred_norm = model_norm.predict(X_test_norm)

mse_norm = mean_squared_error(y_test, y_pred_norm)


# Now, using standardization

scaler_stand = StandardScaler()

X_train_stand = scaler_stand.fit_transform(X_train)

X_test_stand = scaler_stand.transform(X_test)


model_stand = Ridge(random_state=42)

model_stand.fit(X_train_stand, y_train)

y_pred_stand = model_stand.predict(X_test_stand)

mse_stand = mean_squared_error(y_test, y_pred_stand)



mse_org, mse_norm, mse_stand

#(60624.48835883284, 139.23793019646445, 108.22143505709198)

Choosing between normalization and standardization depends on the specific requirements of your machine learning algorithm and the characteristics of your data. Understanding the distinctions between these techniques allows for more informed decisions in the data preprocessing phase, ultimately leading to better model performance.

The decision to use normalization or standardization is guided by the needs of your machine learning model and the nature of your dataset. Grasping the differences between these preprocessing methods enables more strategic choices during data preparation, enhancing model accuracy and effectiveness. As an illustrative case, consider the use of standardization over normalization for the K-means algorithm, where standardization proves to be more advantageous due to specific reasons.

# Re-import necessary libraries and re-define code after reset
from sklearn.datasets import make_blobs
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np

# Seed for reproducibility
np.random.seed(42)

# Generate synthetic data with two features
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
X[:, 1] = X[:, 1] * X[:,0] + 100 # Scale up the second feature

# Normalize the data
scaler_norm = MinMaxScaler()
X_norm = scaler_norm.fit_transform(X)

# Standardize the data
scaler_stand = StandardScaler()
X_stand = scaler_stand.fit_transform(X)

# Apply K-means clustering on normalized data
kmeans_norm = KMeans(n_clusters=4, random_state=42).fit(X_norm)
labels_norm = kmeans_norm.labels_
score_normalized = silhouette_score(X_norm, labels_norm)

# Apply K-means clustering on standardized data
kmeans_stand = KMeans(n_clusters=4, random_state=42).fit(X_stand)
labels_stand = kmeans_stand.labels_
score_standardized = silhouette_score(X_stand, labels_stand)

print(f"Silhouette Score with Standardization: {score_standardized}")
print(f"Silhouette Score with Normalization: {score_normalized}")


# Silhouette Score with Standardization: 0.5179846669454243
# Silhouette Score with Normalization: 0.4548279288084633

In summarizing the key differences between normalization and standardization, the following table encapsulates their distinct characteristics, applications, and implications for machine learning algorithms:

This comparison highlights the importance of selecting the appropriate preprocessing technique based on the specific needs of the dataset and the chosen machine learning model. By understanding the benefits and limitations of normalization and standardization, practitioners can make more informed decisions that ultimately improve model performance.

--

--

Shaily Mishra
Shaily Mishra

Written by Shaily Mishra

Data Scientist @Microsoft | MS @ Machine Learning Lab, IIIT Hyderabad

No responses yet