- DGD Blog">DGD Blog
- >
- Data Science
- >
- Feature engineering & EDA

Thirumala Reddy

December 16, 2022

Seclect Blog Category

**STEPS IN DATA PREPROCESSING:-**Â

- Validate your data
- Handle Nulls in the dataset
- Handling categorical columns
- Handle Outliers
- Handle Imbalanced data
- Feature Selection
- Scale your data
- Split your data into

**Â Â **

**Validate your dataÂ**

- Info, Nulls, Describe
- Check value_counts for some columns you doubt (categorical)

**Handle Nulls in the datasetÂ**

- Â Remove Rows having Nulls (losing data)

Â In most cases removing the null values is not preferred. If we have millions of records & in that only less number of records are missing we can remove the null values.

- Fill Values - mean, median, mode, ffill, bfill

The best way of dealing with null values is by filling in the null values. Depending on the type of data we have we can fill the null values by using mean, median, mode, ffill & bfill.

**--> Let's see some of the use cases to fill null values:-**

- When we have an age column with null values fill the null value with the mean or median.
- If we have categorical data we can fill the null values with mode
- We have a stock value of a particular product of 1/05/2022,2/05/2022 & 4/05/2022. The stock value of 3/05/2022 is missing in this condition it is always suggestible to fill the stock value of 3/05/2022 by using forward filling(ffill) i.e. in forward filling the stock value of 2/05/2022 is filled for 3/05/2022.
- Similarly in Backward filling, we fill the stock value of 3/05/2022 is filled with the value of 4/05/2022.
- The major drawback in the forward filling is if the first row has null values means the null values will be maintained throughout the data & vice-versa with backward filling.

**-->Â Â After filling in the null values with any of the methods, plot a graph between the old values i.e. before filling in the null values & the new values i.e. after filling in the null values. This will help us to find the change in the distribution of data before & after filling in the null values.**

**Advantages And Disadvantages of Mean/Median Imputation:-**

Advantages:-

- Mean/Median Imputation is easy to implement.
- Median Imputation is not affected by outliers.

Disadvantages:-

- Mean/Median Imputation makes changes in the original variance
- Mean/Median Imputation Impacts the correlation

**Machine learning ways**

By using the machine learning models such as KNN & Iterative Imputer methods also we can fill the null value

**Click here to see how to handle null values**

3.** Handling categorical columns**

Let's discuss types of Encoding. There are two types of encoding

1) Nominal Encoding

2) Ordinal Encoding

**1)Nominal Encoding:-**

Nominal encoding is nothing but the features where variables have no order or rank to this variable's feature.

-->The different types of Nominal Encoding are

- One Hot Encoding
- One Hot Encoding with many categorical
- Mean Coding

Among these, all Nominal Encoding One Hot Encoding is preferred.

**One Hot Encoder**

Â If categories > 2 & <7 use One Hot Encoder. By using a Label Encoder for the dataset which more than 2 categories there is a chance of High Bias so by using One hot encoder we can remove the bias of higher numbers.

Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â

**2. Ordinal Encoder**

Ordinal Encoding is nothing but theÂ feature where variables have someÂ order or rank.

**Label Encoding****Target guided ordinal Encoding.**

**Label Encoding**

Label Encoder assigns a unique number (starting from 0) to each category.

-->If we have nan values present in the data & after doing Label Encoding, the nan value also will be classified into separate categories.

Â Â Â Â Â Â Â

**Advantages: -**

Â Â Â Â 1)Straightforward to implement

Â Â Â Â 2)Does not require hours of variableÂ exploration

Â Â Â Â 3)Does not expand massively the feature space(no of columns in the dataset)

**Disadvantages: -**

1)Does not add any information that may make the variablesÂ more predictive

2)Does not keep the information of the ignored labels

**Click here to see how to apply one hot encoder & Label Encoder **

**-->**We can handle categorical columns by using a library present in pandas i.e.Â get_dummies

**Click here to know how to apply get_dummies to handle categorical features **

**Â 4. Handle Outlier**

Â Â Â Â a. Remove outlier (Not recommended)

Â Â Â Â b. Clip

Â Â Â Â c. Make outliers as Nulls, and do Fill in Missing

**Which Machine LearningÂ Models Are Sensitive To Outliers?**

- NaÃ¯veÂ Bayes Classifier---------------------Not Sensitive To Outliers
- SVM-------------------------------Not Sensitive To Outliers
- Linear Regression-----------------Sensitive To Outliers
- Logistic Regression---------------Sensitive To Outliers
- Decision Tree Regressor or Classifier----Not Sensitive
- Ensemble(RF,XGboost,GB)-----------Not Sensitive
- KNN-------------------------------Not Sensitive
- KÂ Means---------------------------Sensitive
- Hierarchal------------------------Sensitive
- PCA------------------------------Sensitive
- Neural Networks------------------Sensitive

**How to find out the outliers:-**

**1)By Using Box plot:-**

We can find the outliers by using a box plot. We can consider the values as outliers if they are less than the minimum value & greater than the maximum value from the Box plot

2)**By Using Z-Score: -**

Â Â The values which are less than the mean- 3*std and greeter than the mean+ 3*std are treated as outliers.

If the values are greaterÂ than the upper quartile and less than the lower quartile, then those values are treated as outliers.

-->By all the above methods, we can find out outliers.

-->Now we can handle the outliers by removing them, but it results in data loss. So, removing the outliers is not preferred.

-->Now let's take a look at another method to handle outliers i.e. by using KNN IMPUTER.

-->In this method first we make the outliers as nan values & then we fill the nan values by using KNN Imputer.

-->KNN Imputer works on the concept of distance i.e. KNN Imputer fills the nan value with the nearest value present in the dataset. The distance is calculated by using the Euclidian distance or Manhattan distance.

**Click here to see how to handle Outliers**

**5. Feature SelectionÂ **

**Â Â Â Â Â a. Manual AnalysisÂ **

**Â Â Â Â b. U****nivariate****Â Selection**

**Â Â Â Â c. Feature Importance**

**Â Â Â Â d. Correlation Matrix with Heatmap**

**Â Â Â Â ****e. PCA (Principle component analysis)**

-->For selecting the features manually we will take the help of domain exports.

Â Â Â Ex: - While solving Banking domain problem statements we will take the help of banking domain people for selecting the features.

**Univariate selection:-**

Â Â In the univariate selection, we use the SelectKBest library present inside learn. SelectKBest internally applies the chi-square test and gives the out chi-square score. Based on this we will select the top features.

**Click here to see how to select features by using the univariate selection**

**Correlation Matrix with Heatmap**

In this, we construct the correlation matrix with a heatmap, and from the heat map, we can get what are the features that are more important for predicting the output.

-->From the above heatmap, we can observe that the price_range is the output columns & with respect to output columns ram has the highest correlation value of 0.92 next is battery power, etc.

**Click here to see how to select features by using a correlation matrix with a heatmap**

**6. Scale your dataÂ **(normalize data in a certain range)

Â MinMax Scaler, Standard Scaler, Robust Scaler

Scaling helps to bring all Columns into a Particular Range

**1)MinMax Scaler: -**

MinMax scaler converts the data between 0 & 1 by using the min-max formula.

-->Below is the formula of the min-max scaler.

The standard scaler converts the data values such that mean=0 & standard deviation=1.

Robust Scaler is used to scale the feature to median and quantiles Scaling using median and quantiles consists of subtracting the median from all the observations and then dividing by the interquartile difference.

IQR = 75th quantile - 25th quantile

X_scaled = (X - X.median) / IQR

**Click here to see how to apply scaling**

Some machine learning algorithms like linear and logistic assume that the features are normally distributed

If the data is not normally distributed follow the below steps

- logarithmic transformation

- reciprocal transformation

- square root transformation

- exponential transformation (more general, you can use any exponent)

- box cox transformation

**Refer to this for all the above transformation technique implementation**

**Â ****WhichÂ Â Models require Scaling of the data?**

1)Linear Regression-->Require

2)Logistic Regression-->Require

3)Decision Tree-->Not Require

4)Random Forest-->Not Require

5)XG Boost-->Not Require

6)KNN-->Require

7)K-Means-->Require

8)ANN-->Require

**i.e.**** distance-based models & the models which use the concept of Gradient Descent require ****Scaling****.**

**-->fit_transform is applied only on theÂ ****training****Â dataset & on the testing dataset only transform is used, this is done to avoid data leakage.**

**Click here for complete end-to-endÂ ****processing**

**Let's Discuss some of the Automated EDA Library**

There are different kinds of Automated EDA Library.

-->Some of them are

Â Â Â Â Â Â 1)DTale

Â Â Â Â Â Â 2)Pandas Profiling

Â Â Â Â Â Â 3)Seeetviz

Â Â Â Â Â Â 4)autoviz

Â Â Â Â Â Â 5)DataPrep

Â Â Â Â Â Â 6)Pandas Visual Analysis

**Click here to see how to apply the Automated EDA Library**

Seclect Blog Category

**STEPS IN DATA PREPROCESSING:-**Â

- Validate your data
- Handle Nulls in the dataset
- Handling categorical columns
- Handle Outliers
- Handle Imbalanced data
- Feature Selection
- Scale your data
- Split your data into

**Â Â **

**Validate your dataÂ**

- Info, Nulls, Describe
- Check value_counts for some columns you doubt (categorical)

**Handle Nulls in the datasetÂ**

- Â Remove Rows having Nulls (losing data)

Â In most cases removing the null values is not preferred. If we have millions of records & in that only less number of records are missing we can remove the null values.

- Fill Values - mean, median, mode, ffill, bfill

The best way of dealing with null values is by filling in the null values. Depending on the type of data we have we can fill the null values by using mean, median, mode, ffill & bfill.

**--> Let's see some of the use cases to fill null values:-**

- When we have an age column with null values fill the null value with the mean or median.
- If we have categorical data we can fill the null values with mode
- We have a stock value of a particular product of 1/05/2022,2/05/2022 & 4/05/2022. The stock value of 3/05/2022 is missing in this condition it is always suggestible to fill the stock value of 3/05/2022 by using forward filling(ffill) i.e. in forward filling the stock value of 2/05/2022 is filled for 3/05/2022.
- Similarly in Backward filling, we fill the stock value of 3/05/2022 is filled with the value of 4/05/2022.
- The major drawback in the forward filling is if the first row has null values means the null values will be maintained throughout the data & vice-versa with backward filling.

**-->Â Â After filling in the null values with any of the methods, plot a graph between the old values i.e. before filling in the null values & the new values i.e. after filling in the null values. This will help us to find the change in the distribution of data before & after filling in the null values.**

**Advantages And Disadvantages of Mean/Median Imputation:-**

Advantages:-

- Mean/Median Imputation is easy to implement.
- Median Imputation is not affected by outliers.

Disadvantages:-

- Mean/Median Imputation makes changes in the original variance
- Mean/Median Imputation Impacts the correlation

**Machine learning ways**

By using the machine learning models such as KNN & Iterative Imputer methods also we can fill the null value

**Click here to see how to handle null values**

3.** Handling categorical columns**

Let's discuss types of Encoding. There are two types of encoding

1) Nominal Encoding

2) Ordinal Encoding

**1)Nominal Encoding:-**

Nominal encoding is nothing but the features where variables have no order or rank to this variable's feature.

-->The different types of Nominal Encoding are

- One Hot Encoding
- One Hot Encoding with many categorical
- Mean Coding

Among these, all Nominal Encoding One Hot Encoding is preferred.

**One Hot Encoder**

Â If categories > 2 & <7 use One Hot Encoder. By using a Label Encoder for the dataset which more than 2 categories there is a chance of High Bias so by using One hot encoder we can remove the bias of higher numbers.

Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â

**2. Ordinal Encoder**

Ordinal Encoding is nothing but theÂ feature where variables have someÂ order or rank.

**Label Encoding****Target guided ordinal Encoding.**

**Label Encoding**

Label Encoder assigns a unique number (starting from 0) to each category.

-->If we have nan values present in the data & after doing Label Encoding, the nan value also will be classified into separate categories.

Â Â Â Â Â Â Â

**Advantages: -**

Â Â Â Â 1)Straightforward to implement

Â Â Â Â 2)Does not require hours of variableÂ exploration

Â Â Â Â 3)Does not expand massively the feature space(no of columns in the dataset)

**Disadvantages: -**

1)Does not add any information that may make the variablesÂ more predictive

2)Does not keep the information of the ignored labels

**Click here to see how to apply one hot encoder & Label Encoder **

**-->**We can handle categorical columns by using a library present in pandas i.e.Â get_dummies

**Click here to know how to apply get_dummies to handle categorical features **

**Â 4. Handle Outlier**

Â Â Â Â a. Remove outlier (Not recommended)

Â Â Â Â b. Clip

Â Â Â Â c. Make outliers as Nulls, and do Fill in Missing

**Which Machine LearningÂ Models Are Sensitive To Outliers?**

- NaÃ¯veÂ Bayes Classifier---------------------Not Sensitive To Outliers
- SVM-------------------------------Not Sensitive To Outliers
- Linear Regression-----------------Sensitive To Outliers
- Logistic Regression---------------Sensitive To Outliers
- Decision Tree Regressor or Classifier----Not Sensitive
- Ensemble(RF,XGboost,GB)-----------Not Sensitive
- KNN-------------------------------Not Sensitive
- KÂ Means---------------------------Sensitive
- Hierarchal------------------------Sensitive
- PCA------------------------------Sensitive
- Neural Networks------------------Sensitive

**How to find out the outliers:-**

**1)By Using Box plot:-**

We can find the outliers by using a box plot. We can consider the values as outliers if they are less than the minimum value & greater than the maximum value from the Box plot

2)**By Using Z-Score: -**

Â Â The values which are less than the mean- 3*std and greeter than the mean+ 3*std are treated as outliers.

If the values are greaterÂ than the upper quartile and less than the lower quartile, then those values are treated as outliers.

-->By all the above methods, we can find out outliers.

-->Now we can handle the outliers by removing them, but it results in data loss. So, removing the outliers is not preferred.

-->Now let's take a look at another method to handle outliers i.e. by using KNN IMPUTER.

-->In this method first we make the outliers as nan values & then we fill the nan values by using KNN Imputer.

-->KNN Imputer works on the concept of distance i.e. KNN Imputer fills the nan value with the nearest value present in the dataset. The distance is calculated by using the Euclidian distance or Manhattan distance.

**Click here to see how to handle Outliers**

**5. Feature SelectionÂ **

**Â Â Â Â Â a. Manual AnalysisÂ **

**Â Â Â Â b. U****nivariate****Â Selection**

**Â Â Â Â c. Feature Importance**

**Â Â Â Â d. Correlation Matrix with Heatmap**

**Â Â Â Â ****e. PCA (Principle component analysis)**

-->For selecting the features manually we will take the help of domain exports.

Â Â Â Ex: - While solving Banking domain problem statements we will take the help of banking domain people for selecting the features.

**Univariate selection:-**

Â Â In the univariate selection, we use the SelectKBest library present inside learn. SelectKBest internally applies the chi-square test and gives the out chi-square score. Based on this we will select the top features.

**Click here to see how to select features by using the univariate selection**

**Correlation Matrix with Heatmap**

In this, we construct the correlation matrix with a heatmap, and from the heat map, we can get what are the features that are more important for predicting the output.

-->From the above heatmap, we can observe that the price_range is the output columns & with respect to output columns ram has the highest correlation value of 0.92 next is battery power, etc.

**Click here to see how to select features by using a correlation matrix with a heatmap**

**6. Scale your dataÂ **(normalize data in a certain range)

Â MinMax Scaler, Standard Scaler, Robust Scaler

Scaling helps to bring all Columns into a Particular Range

**1)MinMax Scaler: -**

MinMax scaler converts the data between 0 & 1 by using the min-max formula.

-->Below is the formula of the min-max scaler.

The standard scaler converts the data values such that mean=0 & standard deviation=1.

Robust Scaler is used to scale the feature to median and quantiles Scaling using median and quantiles consists of subtracting the median from all the observations and then dividing by the interquartile difference.

IQR = 75th quantile - 25th quantile

X_scaled = (X - X.median) / IQR

**Click here to see how to apply scaling**

Some machine learning algorithms like linear and logistic assume that the features are normally distributed

If the data is not normally distributed follow the below steps

- logarithmic transformation

- reciprocal transformation

- square root transformation

- exponential transformation (more general, you can use any exponent)

- box cox transformation

**Refer to this for all the above transformation technique implementation**

**Â ****WhichÂ Â Models require Scaling of the data?**

1)Linear Regression-->Require

2)Logistic Regression-->Require

3)Decision Tree-->Not Require

4)Random Forest-->Not Require

5)XG Boost-->Not Require

6)KNN-->Require

7)K-Means-->Require

8)ANN-->Require

**i.e.**** distance-based models & the models which use the concept of Gradient Descent require ****Scaling****.**

**-->fit_transform is applied only on theÂ ****training****Â dataset & on the testing dataset only transform is used, this is done to avoid data leakage.**

**Click here for complete end-to-endÂ ****processing**

**Let's Discuss some of the Automated EDA Library**

There are different kinds of Automated EDA Library.

-->Some of them are

Â Â Â Â Â Â 1)DTale

Â Â Â Â Â Â 2)Pandas Profiling

Â Â Â Â Â Â 3)Seeetviz

Â Â Â Â Â Â 4)autoviz

Â Â Â Â Â Â 5)DataPrep

Â Â Â Â Â Â 6)Pandas Visual Analysis

**Click here to see how to apply the Automated EDA Library**

Dataisgood is on a mission to ensure that everyone has the opportunity to thrive in an inclusive environment that fosters equal opportunities for advancement and progress. At Dataisgood, we empower individuals with live, hands-on training led by industry experts. Our goal is to facilitate successful transitions for those from non-tech backgrounds, equipping them with the skills and knowledge needed to excel in the tech industry. Additionally, we offer upskilling and reskilling opportunities through our industry-approved training programs, ensuring that professionals stay ahead inÂ theirÂ careers

- Degree Programs

- Certification Programs

**Dataisgood LLC**

447 Broadway,2nd Floor,

1036, New York,

NY 10013, USA

Ph:Â +1 205-839-2824

**Dataisgood Limited**

128, City Road, London,

EC1V2NX, United Kingdom.

Ph:Â +44 7700155055

**Addictive Learning Technology Limited**Space Creattors Heights,

3rd floor,Landmark Cyber Park,

Golf Course Extension,Sector 67,

Gurugram, Haryana -122002

Ph:+91-8700627800

**Skill Arbitrage Technology, Inc.**8, The Green, STE B(street),

Dover, County of Kent

Zip Code 19901.

Ph: +44 7700155055

**Addictive Learning Technology Inc.**C/O: Incorp Pro, 170-422 Richards St

Vancouver BC V6B 2Z4,Canada

Ph:Â +1 718-682-7717

Â© ADDICTIVE LEARNING TECHNOLOGY LIMITED