Feature Engineering & Feature Selection: Complete Beginner Guide
Feature Engineering and Feature Selection are two of the most important steps in Machine Learning.
They can dramatically improve model accuracy, often more than changing algorithms.
What Are Features in Data Science?
Features = Columns in your dataset.
Examples:
In a student dataset → Age, Marks, Attendance
In a house price dataset → Size, Location, Rooms, Age of house
In a customer dataset → Income, Gender, Spending score
Better features = Better models.
PART 1: Feature Engineering
Feature Engineering means creating new features or modifying existing ones to help the model learn better.
In simple words:
Feature Engineering = Making data more useful and meaningful.
Why Feature Engineering Is Important?
Because machine learning models cannot understand raw messy data.
Feature Engineering helps:
Reveal hidden patterns
Reduce noise
Improve accuracy
Handle categorical/text/time data
Make models learn faster
It is often called the secret weapon of Data Scientists.
Common Feature Engineering Techniques
Creating New Features (Feature Creation)
Examples:
Creating Ratios
df["Income_per_family_member"] = df["Income"] / df["Family_Size"]
Combining Columns
df["Total_Score"] = df["Math"] + df["Science"] + df["English"]
Extracting Text Features
From a review column:
Length of review
Number of words
Sentiment score
Handling Categorical Variables (Encoding)
Models cannot understand text → convert to numbers.
One-Hot Encoding
pd.get_dummies(df["City"])
Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["Gender"] = le.fit_transform(df["Gender"])
Binning / Bucketing
Convert continuous values into categories.
Example: Age Groups
bins = [0,18,60,100]
labels = ["Child","Adult","Senior"]
df["Age_Group"] = pd.cut(df["Age"], bins=bins, labels=labels)
Date & Time Feature Extraction
Extracting useful components from date:
df["Date"] = pd.to_datetime(df["Date"])
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month
df["Day"] = df["Date"].dt.day
df["Weekday"] = df["Date"].dt.weekday
Useful for sales, time-series, forecasting.
Log Transformation
Used when data is skewed (e.g., income).
df["Income_log"] = np.log(df["Income"] + 1)
Interaction Features
Multiply or combine features to create relationship-based features.
Example: Area = Length × Width
df["Area"] = df["Length"] * df["Width"]
Polynomial Features
Used for models that need curve fitting.
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
new_features = poly.fit_transform(df[["Price"]])
PART 2: Feature Selection
Feature Selection means choosing only the most important features and removing the useless ones.
In simple words:
Feature Selection = Keep only the useful features, remove the junk.
Why?
Reduces overfitting
Improves accuracy
Makes models faster
Reduces training time
Removes noise
Types of Feature Selection
Filter Methods
Based on statistics — fast and simple.
Correlation
Remove highly correlated features.
df.corr()Chi-Square Test (for categorical data)
from sklearn.feature_selection import chi2
Variance Threshold
Remove features with no variation.
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0)
Wrapper Methods
Try multiple combinations and select the best.
Forward Selection
Start with 1 feature → add more.
Backward Selection
Start with all features → remove weak ones.
RFE (Recursive Feature Elimination)
Very popular.
from sklearn.feature_selection import RFE
Embedded Methods
Feature selection happens inside the model.
Lasso Regression (L1 regularization)
Removes weak features.
from sklearn.linear_model import Lasso
Decision Trees / Random Forest
Tell you feature importance.
model.feature_importances_
Python Example of Feature Selection
from sklearn.feature_selection import SelectKBest, f_classif
X = df.drop("Target", axis=1)
y = df["Target"]
best_features = SelectKBest(score_func=f_classif, k=5)
fit = best_features.fit(X, y)
df_scores = pd.DataFrame(fit.scores_)
Real-World Example
Problem: Predict House Prices
Dataset columns:
| Size | Rooms | Age | Location | Distance | Owner_Name |
|---|
Feature Engineering
Create “Price_per_sqft”
Convert “Location” into one-hot encoding
Convert “Age” → “New/Old” category
Feature Selection
Remove:
Owner_Name (not useful)
Distance (weak correlation)
Highly correlated features
After engineering & selection, the model becomes:
More accurate
Easier to train
More stable