Feature Engineering & Feature Selection: Complete Beginner Guide

Feature Engineering and Feature Selection are two of the most important steps in Machine Learning.
They can dramatically improve model accuracy, often more than changing algorithms.

What Are Features in Data Science?

Features = Columns in your dataset.

Examples:

  • In a student dataset → Age, Marks, Attendance

  • In a house price dataset → Size, Location, Rooms, Age of house

  • In a customer dataset → Income, Gender, Spending score

Better features = Better models.

PART 1: Feature Engineering

Feature Engineering means creating new features or modifying existing ones to help the model learn better.

In simple words:

 Feature Engineering = Making data more useful and meaningful.

Why Feature Engineering Is Important?

Because machine learning models cannot understand raw messy data.

Feature Engineering helps:

  • Reveal hidden patterns

  • Reduce noise

  • Improve accuracy

  • Handle categorical/text/time data

  • Make models learn faster

It is often called the secret weapon of Data Scientists.

Common Feature Engineering Techniques

Creating New Features (Feature Creation)

Examples:

Creating Ratios

df["Income_per_family_member"] = df["Income"] / df["Family_Size"]

Combining Columns

df["Total_Score"] = df["Math"] + df["Science"] + df["English"]

Extracting Text Features

From a review column:

  • Length of review

  • Number of words

  • Sentiment score

Handling Categorical Variables (Encoding)

Models cannot understand text → convert to numbers.

One-Hot Encoding

pd.get_dummies(df["City"])

Label Encoding

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["Gender"] = le.fit_transform(df["Gender"])

Binning / Bucketing

Convert continuous values into categories.

Example: Age Groups

bins = [0,18,60,100]
labels = ["Child","Adult","Senior"]
df["Age_Group"] = pd.cut(df["Age"], bins=bins, labels=labels)

Date & Time Feature Extraction

Extracting useful components from date:

df["Date"] = pd.to_datetime(df["Date"])
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month
df["Day"] = df["Date"].dt.day
df["Weekday"] = df["Date"].dt.weekday

Useful for sales, time-series, forecasting.

Log Transformation

Used when data is skewed (e.g., income).

df["Income_log"] = np.log(df["Income"] + 1)

Interaction Features

Multiply or combine features to create relationship-based features.

Example: Area = Length × Width

df["Area"] = df["Length"] * df["Width"]

Polynomial Features

Used for models that need curve fitting.

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
new_features = poly.fit_transform(df[["Price"]])

PART 2: Feature Selection

Feature Selection means choosing only the most important features and removing the useless ones.

In simple words:

 Feature Selection = Keep only the useful features, remove the junk.

Why?

  • Reduces overfitting

  • Improves accuracy

  • Makes models faster

  • Reduces training time

  • Removes noise

Types of Feature Selection

Filter Methods

Based on statistics — fast and simple.

Correlation

Remove highly correlated features.

df.corr()

Chi-Square Test (for categorical data)

from sklearn.feature_selection import chi2

Variance Threshold

Remove features with no variation.

from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0)

Wrapper Methods

Try multiple combinations and select the best.

Forward Selection

Start with 1 feature → add more.

Backward Selection

Start with all features → remove weak ones.

RFE (Recursive Feature Elimination)

Very popular.

from sklearn.feature_selection import RFE

Embedded Methods

Feature selection happens inside the model.

Lasso Regression (L1 regularization)

Removes weak features.

from sklearn.linear_model import Lasso

Decision Trees / Random Forest

Tell you feature importance.

model.feature_importances_

Python Example of Feature Selection

from sklearn.feature_selection import SelectKBest, f_classif

X = df.drop("Target", axis=1)
y = df["Target"]

best_features = SelectKBest(score_func=f_classif, k=5)
fit = best_features.fit(X, y)

df_scores = pd.DataFrame(fit.scores_)

Real-World Example

Problem: Predict House Prices

Dataset columns:

SizeRoomsAgeLocationDistanceOwner_Name

 Feature Engineering

  • Create “Price_per_sqft”

  • Convert “Location” into one-hot encoding

  • Convert “Age” → “New/Old” category

 Feature Selection

Remove:

  • Owner_Name (not useful)

  • Distance (weak correlation)

  • Highly correlated features

After engineering & selection, the model becomes:

  • More accurate

  • Easier to train

  • More stable