Gradient

Data Preprocessing

December 11, 2025 (3d ago)

Data Preprocessing

Data Preprocessing in Machine Learning đź§ 

machine learning models only look smart if you feed them good data. And most data in the real world is, frankly, a mess missing values everywhere, weird formats, numbers and text all jumbled up. That’s why data preprocessing is a big deal.

Think of preprocessing like prepping your ingredients before you cook. Even chefs can’t save a meal if the veggies are dirty or the chicken isn’t cut right.

Here’s what we’ll cover in this guide (github repo at very bottom btw!):

  • Why preprocessing actually matters
  • The steps you should follow
  • How to deal with missing data
  • Turning text into numbers (the easy way)
  • Why train/test splits are crucial
  • When to use fit() vs. transform()
  • Should you use Standardization or Normalization?
  • Full, copy-ready code examples

What Is Data Preprocessing?

Data preprocessing is the set of steps you use to turn ugly, messy datasets into clean and tidy numbers so your machine learning model doesnt freak out. If you skip this stuff your model will potentially either fail or just learn nonsense.

messy data usually has stuff like:

  • missing values
  • text columns that need converting to numbers or binary
  • widly different ranges of numbers (like “Age” 1–90 and “Salary” 20,000–200,000)
  • different types mixed together (strings, integers, floats)
  • Outliers
  • Columns you don’t need

Preprocessing makes sure your data is actually usable and gives your algorithm a refined dataset to learning something.


The Standard Machine Learning Preprocessing Pipeline

here is the standard preprocessing pipeline 1. Import libraries 2. Import the dataset 3. Fix missing data 4. Encode categorical features 5. Split into training & test sets 6. Scale your features

This might look simple, but each step has some real logic behind it. Let’s break them down! also we some best libraries for machine learning is sci-kit learn, pandas, numpy and matplotlib which we will be using them later... not important for now but just cuz uk


Types of Variables

In any dataset, you’ll find two main kinds of columns:

  1. Feature variables (independent variables): These are all the inputs or properties the things you use to help predict something else.
  2. Target variable (dependent variable): This is what youre trying to predict or explain. The model looks at the features to learn patterns and figure out this target.

Handling Missing Data

missing values are normal in real data, but ML algorithms hate them.

Three main ways to deal with them:

1. just remove them

you can remove them but do this whenever u feel like there is barely any row missing cuz if u remove max portion then its just pointless

2. Fill Them In (“Impute”)

This is what people usually do.

  • numerical columns:(age, salary) fill with median or mean
  • category columns:(gender, caste!!) fill with the most frequent value
Text
from sklearn.impute import SimpleImputer
import numpy as np

num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")

df[num_cols] = num_imputer.fit_transform(df[num_cols])
df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])

Encoding Categorical Data

Algorithms can’t handle words or text—they only work with numbers. So, we need to “encode” categories.

Label Encoding

Simply gives each category its own number (also its only valid for dependent variable)

ColorEncoded
Red0
Blue1
Green2

Only use label encoding if the categories have a natural order (like Low < Medium < High).

One-Hot Encoding (Most Common and used for feature variable)

Turns each category into a brand-new column that’s either 1 or 0.

SexSex_MaleSex_Female
male10
female01

This is almost always better for things like city, gender, country, etc.

A ColumnTransformer in scikit-learn helps automate this (just pass it a list of categorical columns):

Text
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(
    transformers=[("encode", OneHotEncoder(), ['Gender','City'])],
    remainder='passthrough'
)

X = ct.fit_transform(X)

why is it important to split the dataset into training and test sets?

Honestly, this step is misunderstood by almost everyone starting out, but it’s super important.

💡 1. It’s How the Real World Works

The “test” set acts like totally new data—it helps you see if your model would actually work out in the wild.

đź’ˇ 2. Stop the Model From Just Memorizing

If you train AND test on the same data, your model will just memorize stuff, look “perfect”—but it’ll fail when it sees new data.

đź’ˇ 3. Honest, Fair Evaluation

By keeping the test data separate, you’re not “cheating.” Your accuracy is real, not just wishful thinking.

Typical train/test splits

  • 80% train / 20% test
  • 70% train / 30% test
  • For classification tasks, always use stratify=y so that the class ratios stay similar in both sets.
Text
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Feature Scaling — Normalization vs Standardization

keep in mind this step is very optional from models to models and u should only execute it after splitting the datasets Some models get messed up if your features are on totally different scales.

Example:

  • Age ranges from 0 to 100
  • Salary ranges from 20,000 to 200,000

If you don’t scale, salary ends up “dominating” the model just by being bigger numerically.

Standardization (Z-score scaling)

Text
x_standard = (x - mean) / standard_deviation

What happens?

  • The mean becomes 0
  • The standard deviation is 1

Use standardization when you’re working with anything its reliable anywhere and everywhere

Text
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Normalization (Min-Max scaling)

Text
x_norm = (x - min) / (max - min)

Everything gets squeezed into the 0–1 range.

Best for:

  • Neural networks / deep learning
  • When features need to be bounded
Text
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Why Use fit() on Training Data, But Not Test Data?

This is one of the golden rules of ML.

fit() = learn from training data

  • Learns mean & std for scaling
  • Remembers which categories exist
  • Stores imputation values

transform() = applies what it learned

Don’t “fit” on test data! Why?

  • You’d be “peeking” at answers you shouldn’t see
  • It lets info leak from test to train, messing up real-world performance
  • Your accuracy scores will be unrealistically high

So do it like this:

Text
scaler.fit(X_train)       # learn from train data
scaler.transform(X_train) # scale train
scaler.transform(X_test)  # scale test (no fitting!)

Full Example Pipeline (Best Practice)

Check out how to tie it all together using scikit-learn pipelines:

Text
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

num_cols = ['Age', 'Salary']
cat_cols = ['Country', 'Gender']

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encode', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_cols),
    ('cat', cat_pipeline, cat_cols)
])

X_train_prep = preprocessor.fit_transform(X_train)
X_test_prep  = preprocessor.transform(X_test)

This is basically how things are done in the real world—not just in tutorials!


Final Thoughts: Preprocessing Isn’t Optional

You don’t need the fanciest model to get great results. What you do need is clean, well-prepped data.

When you:

  • handle missing values
  • encode categorical features
  • scale your numbers
  • split your data the right way
  • and avoid any data leaks

…your model becomes more accurate, stable, and realistic.


here are the resources and codebases all required

👉 https://github.com/Proxyy587/ml-ops

If this was helpful, feel free to star the repo or subscribe to rss along for more updates