From A to Z: An Exhaustive List of Feature Engineering Techniques for Machine Learning

udit
4 min readJan 11, 2023

--

Source: https://www.javatpoint.com/feature-engineering-for-machine-learning

When working with machine learning algorithms, the process of transforming raw data into useful features that can be used to train models is known as feature engineering. This process is critical for improving the performance of machine learning models, as the quality of the features can have a greater impact on model performance than the choice of algorithm itself. In this article, we will present an exhaustive list of feature engineering techniques that are commonly used in the field of machine learning.

A — Aggregation: This technique involves creating new features by aggregating existing features, such as calculating the mean, median, or mode of a group of numerical features.

B — Binning: This technique involves converting continuous numerical features into categorical features by dividing the data into bins.

C — Combination: This technique involves creating new features by combining existing features, such as multiplying two numerical features together.

D — Discretization: This technique involves converting continuous numerical features into ordinal categorical features by dividing the data into intervals.

E — Expansions: This technique is mainly used for polynomial expansions, this includes square or cubic expansions for data like numerical features or even a degree of 2 for categorical features.

F — Feature Hashing: This technique is useful when working with categorical features with a large number of levels. The feature hashing trick allows you to create a fixed number of new features, with the goal of preserving the information content of the original feature while reducing the dimensionality of the dataset.

G — Grouping: This technique involve grouping observations based on certain criteria, like time, location or even demographics and then extract new features from that grouped data.

H — Handling missing values: This technique involves filling missing values in the dataset, either by using imputation techniques like mean or median imputation or using more sophisticated methods like multiple imputation.

I — Interaction: This technique involves creating new features by interacting existing features, for example, by multiplying two numerical features or concatenating two categorical features.

J — Joining: This technique is used to join features from different datasets together based on a common key or identifier.

K — K-means Clustering: This unsupervised technique can be used to group similar observations together and create new features based on the cluster assignments.

L — Log Transformations: This technique is used to normalize skewed data by applying logarithmic transformations to the features.

M — Missing value ratios: This technique involves calculating the ratio of missing values of each features, and then using these ratios as a new feature.

N — Normalization: This technique is used to scale numerical features to a common range, such as 0 to 1.

O — One-hot encoding: This technique is used to represent categorical variables in a machine learning model by creating binary features for each level of the categorical variable.

P — Principal component analysis: This technique is used to reduce the dimensionality of the dataset by identifying the principal components of the data, which can then be used as new features for the model.

Q — Quantile transformation: This technique is used to change the distribution of numerical features to a specific type like normal or uniform.

R — Random Projection : This technique involves projecting the data into a lower-dimensional space while preserving the relationships between the data points.

S — Scaling: This technique is used to change the scale of numerical features to improve the performance of machine learning models.

T — Text features: This technique involves extracting features from text data like word count, sentence length, or even Part of speech tags.

U — Union: This technique involves merging two or more datasets together based on a common key or identifier.

V — Variable transformation: This technique involves applying mathematical transformations to the features, such as square roots, logarithms, or exponents.

W — Window statistics: This technique involves calculating statistics within a rolling window of observations, such as the mean, median, or standard deviation.

X — XGBoost feature importances: This technique is used to determine the importance of each feature in an XGBoost model by measuring the impact of each feature on the model’s performance.

Y — Year-over-year change: This technique involves calculating the change in a feature over time, such as the year-over-year change in the value of a stock.

Z — Zero-based normalization: This technique involves normalizing numerical features so that they have a zero mean and a standard deviation of one.

In conclusion, feature engineering is a vital step in the machine learning process, as the quality of the features can have a greater impact on model performance than the choice of algorithm itself. The techniques listed above are just a small sample of the many feature engineering techniques available, and the best approach will depend on the specific characteristics of your dataset. The key is to experiment with different techniques and find the ones that work best for your data. And as always, keep in mind the potential bias and ethical considerations of your feature engineering choices.

--

--

udit
udit

No responses yet