Tag: AI

Keras 3.0 Tutorial: End-to-End Deep Learning Project Guide

Peng Qian

Implement an encoder-decoder recurrent network from scratch

Continue reading on Towards Data Science »

Originally appeared here:
Keras 3.0 Tutorial: End-to-End Deep Learning Project Guide

Go Here to Read this Fast! Keras 3.0 Tutorial: End-to-End Deep Learning Project Guide

May 18, 2024
The Physics Behind Data

Tim Lou, PhD

How physics principles give us deeper insights into our data

Continue reading on Towards Data Science »

Originally appeared here:
The Physics Behind Data

Go Here to Read this Fast! The Physics Behind Data

May 18, 2024
Decoding Time: Unraveling the Power of LSTM vs. N-BEATS for Accurate Time Series Forecasting

Frida Karvouni

Comparing how two deep learning models fit seasonality

Continue reading on Towards Data Science »

Originally appeared here:
Decoding Time: Unraveling the Power of LSTM vs. N-BEATS for Accurate Time Series Forecasting

Go Here to Read this Fast! Decoding Time: Unraveling the Power of LSTM vs. N-BEATS for Accurate Time Series Forecasting

May 18, 2024
Are GPTs Good Embedding Models
Yu-Cheng Tsai
A surprising experiment to show that the devil is in the details

Image by the author using DALL-E

Embeddings Leaderboard

With the growing number of embedding models available, choosing the right one for your machine learning applications can be challenging. Fortunately, the MTEB leaderboard provides a comprehensive range of ranking metrics for various natural language processing tasks.

Top 5 embedding models from the MTEB leaderboard as of May 17th, 2024

When you visit the site, you’ll notice that the top five embedding models are Generative Pre-trained Transformers (GPTs). This might lead you to think that GPT models are the best for embeddings. But is this really true? Let’s conduct an experiment to find out.

GPT Embeddings

Embeddings are tensor representation of texts, that converts text token IDs and projects them into a tensor space.

By inputting text into a neural network model and performing a forward pass, you can obtain embedding vectors. However, the actual process is a bit more complex. Let’s break it down step by step:
1. Convert the text into token IDs
2. Pass the token IDs into a neural network
3. Return the outputs of the neural network
In the first step, I am going to use a tokenizer to achieve it. model_inputs is the tensor representation of the text content, “some questions” .
```
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.1"
)

messages = [
        {
            "role": "user",
            "content": "some questions",
        },
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to("cuda")
```
The second step is straightforward, passing the model_inputs into a neural network. The logits of generated tokens can be accessed via .logits .
```
import torch

with torch.no_grad():
  return model(model_inputs).logits
```
The third step is a bit tricky. GPT models are decoder-only, and their token generation is autoregressive. In simple terms, the last token of a completed sentence has seen all the preceding tokens in the sentence. Therefore, the output of the last token contains all the affinity scores (attentions) from the preceding tokens.

Bingo! You are most interested in the last token because of the attention mechanism in the transformers.

The output dimension of the GPTs implemented in Hugging Face is (batch size, input token size, number of vocabulary). To get the last token output of all the batches, I can perform a tensor slice.
```
import torch
with torch.no_grad():
  return model(model_inputs).logits[:, -1, :]
```
Quality of these GPT Embeddings

To measure the quality of these GPT embeddings, you can use cosine similarity. The higher the cosine similarity, the closer the semantic meaning of the sentences.
```
import torch
def compute_cosine_similarity(input1, input2):
    cos = torch.nn.CosineSimilarity(dim=1, eps=1e-6)
    return cos(input1, input2)
```
Let’s create some util functions that allows us to loop through list of question and answer pairs and see the result. Mistral 7b v0.1 instruct , one of the great open-sourced models, is used for this experiment.
```
from transformers import AutoTokenizer, AutoModelForCausalLM
from torch.distributions import Categorical
import torch
from termcolor import colored


model = AutoModelForCausalLM.from_pretrained(
  "mistralai/Mistral-7B-Instruct-v0.1"
)

tokenizer = AutoTokenizer.from_pretrained(
  "mistralai/Mistral-7B-Instruct-v0.1"
)

def generate_last_token_embeddings(question, max_new_tokens=30): 
    messages = [
        {
            "role": "user",
            "content": question,
        },
    ]
    encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
    model_inputs = encodeds.to("cuda")
    with torch.no_grad():
        return model(model_inputs).logits[:, -1, :]


def get_similarities(q_batch, a_batch):
    for q in q_batch:
        for a in a_batch:
            q_emb, a_emd = generate_last_token_embeddings(q), generate_last_token_embeddings(a),
            similarity = compute_cosine_similarity(q_emb, a_emd)
            print(colored(f"question: {q} and ans: {a}", "green"))
            print(colored(f"result: {similarity}", "blue"))

q = ["Where is the headquarter of OpenAI?",
    "What is GPU?"]
a = ["OpenAI is based at San Francisco.",
     "A graphics processing unit (GPU) is an electronic circuit that can perform mathematical calculations quickly",
    ]
```
Cosine similarities for mistral 7b v0.1 instruct (Image by the author)

Results and Observations

For the first question and answer pair:
- Question: “What is the headquarter of OpenAI?”
- Answer: “OpenAI is based at San Francisco.”
- Cosine Similarity: 0.96
For the second question and answer pair:
- Question: “What is GPU?”
- Answer: “A graphics processing unit (GPU) is an electronic circuit that can perform mathematical calculations quickly.”
- Cosine Similarity: 0.94
For an irrelevant pair:
- Question: “Where is the headquarter of OpenAI?”
- Answer: “A graphics processing unit (GPU) is an electronic circuit that can perform mathematical calculations quickly.”
- Cosine Similarity: 0.90
For the worst pair:
- Question: “What is GPU?”
- Answer: “OpenAI is based at San Francisco.”
- Cosine Similarity: 0.93
These results suggest that using GPT models, in this case, the mistral 7b instruct v0.1, as embedding models may not yield great results in terms of distinguishing between relevant and irrelevant pairs. But why are GPT models still among the top 5 embedding models?

Contrastive Loss Comes to the Rescue
```
tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-mistral-7b-instruct")
model = AutoModelForCausalLM.from_pretrained(
  "intfloat/e5-mistral-7b-instruct"
)
```
Cosine similarities for e5-mistral-7b-instruct (Image by the author)

Repeating the same evaluation procedure with a different model, e5-mistral-7b-instruct, which is one of the top open-sourced models from the MTEB leaderboard and fine-tuned from mistral 7b instruct, I discover that the cosine similarity for the relevant question and pairs are 0.88 and 0.84 for OpenAI and GPU questions, respectively. For the irrelevant question and answer pairs, the similarity drops to 0.56 and 0.67. This findings suggests e5-mistral-7b-instruct is a much-improved model for embeddings. What makes such an improvement?

The contrastive loss function

Delving into the paper behind e5-mistral-7b-instruct, the key is the use of contrastive loss to further fine tune the mistral model.

Unlike GPTs that are trained or further fine-tuned using cross-entropy loss of predicted tokens and labeled tokens, contrastive loss aims to maximize the distance between negative pairs and minimize the distance between the positive pairs.

This blog post covers this concept in greater details. The sim function calculates the cosine distance between two vectors. For contrastive loss, the denominators represent the cosine distance between positive examples and negative examples. The rationale behind contrastive loss is that we want similar vectors to be as close to 1 as possible, since log(1) = 0 represents the optimal loss.

Conclusion

In this post, I have highlighted a common pitfall of using GPTs as embedding models without fine-tuning. My evaluation suggests that fine-tuning GPTs with contrastive loss, the embeddings can be more meaningful and discriminative. By understanding the strengths and limitations of GPT models, and leveraging customized loss like contrastive loss, you can make more informed decisions when selecting and utilizing embedding models for your machine learning projects. I hope this post helps you choose GPTs models wisely for your applications and look forward to hearing your feedback! 🙂

Are GPTs Good Embedding Models was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Are GPTs Good Embedding Models

Go Here to Read this Fast! Are GPTs Good Embedding Models
May 18, 2024
Please Make this AI Less Accurate

Kate Minogue

Demystifying the term “accuracy” in Data Science and Artificial Intelligence

Continue reading on Towards Data Science »

Originally appeared here:
Please Make this AI Less Accurate

Go Here to Read this Fast! Please Make this AI Less Accurate

May 18, 2024
Common Causes of Data Leakage and how to Spot Them

Ivo Bernardo

Let’s learn how to identify and deal with the common causes of data leakage in ML models

Continue reading on Towards Data Science »

Originally appeared here:
Common Causes of Data Leakage and how to Spot Them

Go Here to Read this Fast! Common Causes of Data Leakage and how to Spot Them

May 18, 2024
Mixtral 8x22B is now available in Amazon SageMaker JumpStart

Marco Punio

Today, we are excited to announce the Mixtral-8x22B large language model (LLM), developed by Mistral AI, is available for customers through Amazon SageMaker JumpStart to deploy with one click for running inference. You can try out this model with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms and models so you […]

Originally appeared here:
Mixtral 8x22B is now available in Amazon SageMaker JumpStart

Go Here to Read this Fast! Mixtral 8x22B is now available in Amazon SageMaker JumpStart

May 17, 2024
Building Generative AI prompt chaining workflows with human in the loop

Veda Raman

While Generative AI can create highly realistic content, including text, images, and videos, it can also generate outputs that appear plausible but are verifiably incorrect. Incorporating human judgment is crucial, especially in complex and high-risk decision-making scenarios. This involves building a human-in-the-loop process where humans play an active role in decision making alongside the AI system. In this blog post, you will learn about prompt chaining, how to break a complex task into multiple tasks to use prompt chaining with an LLM in a specific order, and how to involve a human to review the response generated by the LLM.

Originally appeared here:
Building Generative AI prompt chaining workflows with human in the loop

Go Here to Read this Fast! Building Generative AI prompt chaining workflows with human in the loop

May 17, 2024
Feature Engineering for Machine Learning
Sumit Makashir
Enabling the algorithm to work its magic

Photo by Mourizal Zativa on Unsplash

You must have heard the saying “garbage in, garbage out.” This saying is indeed applicable when training machine learning models. If we train machine learning models using irrelevant data, even the best machine learning algorithms won’t help much. Conversely, using well-engineered meaningful features can achieve superior performance even with a simple machine learning algorithm. So, then, how can we create these meaningful features that will maximize our model’s performance? The answer is feature engineering. Working on feature engineering is especially important when working with traditional machine learning algorithms, such as regressions, decision trees, support vector machines, and others that require numeric inputs. However, creating these numeric inputs is not just about data skills. It’s a process that demands creativity and domain knowledge and has as much art as science.

Broadly speaking, we can divide feature engineering into two components: 1) creating new features and 2) processing these features to make them work optimally with the machine learning algorithm under consideration. In this article, we will discuss these two components of feature engineering for cross-sectional, structured, non-NLP datasets.

New Feature Creation

Raw data gathering can be exhausting, and by the end of this task, we might be too tired to invest more time and energy in creating additional features. But this is where we must resist the temptation of diving straight into model training. I promise you that it will be well worth it! At this junction, we should pause and ask ourselves, “If I were to make the predictions manually based on my domain knowledge, what features would have helped me do a good job?” Asking this question may open up possibilities for crafting new meaningful features that our model might have missed otherwise. Once we have considered what additional features we could benefit from, we can leverage the techniques below to create new features from the raw data.

1. Aggregation

As the name suggests, this technique helps us combine multiple data points to create a more holistic view. We typically apply aggregations on continuous numeric data using standard functions like count, sum, average, minimum, maximum, percentile, standard deviation, and coefficient of variation. Each function can capture different elements of information, and the best function to use depends on the specific use case. Often, we can apply aggregation over a particular time or event window that is meaningful in the context of that problem.

Let’s take an example where we want to predict whether a given credit card transaction is fraudulent. For this use case, we can undoubtedly use transaction-specific features, but alongside those features, we can also benefit from creating aggregated customer-level features like:
1. Count of times the customer has been a fraud victim in the last five years: A customer who has been a fraud victim several times previously may be more likely to be a fraud victim again. Hence, using this aggregated customer-level view can provide proper prediction signals.
2. Median of last five transaction amounts: Often, when a credit card is compromised, fraudsters may attempt multiple low-value transactions to test the card. Now, a single low-value transaction is very common and may not be a sign of fraud, but if we see many such transactions in short succession, it may indicate a compromised credit card. For a case like, we can consider creating an aggregated feature that takes into account the last few transaction amounts.
The top chart shows individual transaction amounts and we can see that isolated low-value transactions are not uncommon and do not indicate fraud, however, multiple successive low-value transactions are a sign of fraud. The bottom chart shows a rolling median of last five transaction amounts and only returns a low value if there is a pattern of multiple successive low-value transactions. In this case, the bottom aggregated view makes it possible to distinguish between legitimate low-value transactions and fraudulent low-value transactions using transaction amount as a feature.

2. Differences and Ratios

In many types of problems, change in a set pattern is a valuable signal for prediction or anomaly detection. Differences and ratios are effective techniques for representing changes in numeric features. Just like aggregation, we can also apply these techniques over a meaningful time window in the context of that problem.

Examples:
1. Difference between the percent of new merchant transactions in the last 1 hour and the percent of new merchant transactions in the last 30 days: A high percentage of new merchant transactions in quick succession might indicate fraud risk by itself, but when we see that this behavior has changed as compared to the historical behavior of the customer, it becomes an even more apparent signal.
2. Ratio of current-day transaction count to last 30-day median daily transaction count: When a credit card is compromised, it will likely have many transactions in a short time window, which may not conform to past credit card usage. A significantly high ratio of the current-day transaction count to the last 30-day median daily transaction count may indicate fraudulent usage patterns.
From the table above we can see that a high transaction count on given day by itself may not be an indication of anomalous transaction behavior. In contrast, a ratio-based feature can facilitate the comparison between the customer’s current transaction behavior and their past transaction behavior, and thus can capture anomalies more effectively.

3. Age Encoding

We can use the age calculation technique to convert the date or timestamp features to numeric features by taking the difference between two timestamps or dates. We can also use this technique to convert certain non-numeric features into meaningful numeric features if the tenure associated with the feature values can be a valuable signal for prediction.

Examples:
1. Days since the credit card was last used: A sudden transaction on a credit card that has been dormant for a long time may be associated with a high risk of fraud. We can calculate this feature by taking the time difference between the date since the credit card was last used and the current transaction date.
2. Days since the customer’s device was first used: If we see a transaction coming from a new device, it is likely to be riskier than a transaction made from a device the customer has used for a longer time. We can create a feature that indicates the age of the device as the difference between the date since the customer first used this device and the current transaction date.
The tables above show an example of age encoding. Here, we have created a new numeric feature “Days since transaction device first used” as the difference in days between the customer’s device first use date and the current transaction date

4. Indicator Encoding

Indicator or Boolean features have binary values {1, 0} or {True, False}. Indicator features are very common and are used to represent various types of binary information. In some cases, we may already have such binary features in numeric form, while in other instances, they may have non-numeric values. To use the non-numeric binary features for model training, all we have to do is map them to numeric values.

Looking beyond these common occurrences and uses of indicator features, we can leverage indicator encoding as a tool to represent a comparison between non-numeric data points. This attribute makes it particularly powerful as it creates a way for us to measure the changes in non-numeric features.

Examples:
1. Failed verification during recent login event: A recent failed login event may be associated with a higher risk of fraudulent transactions. In this case, the raw data may have Yes or No values for this feature; all we have to do here is map these values to 1 or 0.
2. Change in the country location from the last transaction: A change in country location may indicate a compromised credit card. Here, creating an indicator feature representing a change in the non-numeric feature ‘country location’ will capture this country change information.
The tables above show an example of indicator encoding. Here we have created a new numeric feature “Country change from previous transaction” by comparing a customer’s current transaction country location to their previous transaction country location

5. One-Hot Encoding

This technique can be applied if our feature data is in categorical form, either numeric or non-numeric. The numeric-categorical form refers to numeric data containing non-continuous or non-measurement data, such as geographical region codes, store IDs, and other such types of data. One hot encoding technique can convert such features into a set of indicator features that we can use in training machine learning models. Applying one hot encoding on a categorical feature will create one new binary feature for every category in that categorical variable. Since the number of new features increases as the number of categories increases, this technique is suitable for features with a low number of categories, especially if we have a smaller dataset. One of the standard rules of thumb suggests applying this technique if we have at least ten records per category.

Examples:
1. Transaction purchase category: Certain types of purchase categories may be associated with a higher risk of fraud. Since the purchase category names are text data, we can apply the one-hot encoding technique to convert this feature into a set of numeric indicator features. If there are ten different purchase category names, one-hot encoding will create ten new indicator features, one for each purchase category name.
2. Device type: An online transaction could be made through several different types of devices, such as an iPhone, Android phone, Windows PC, and Mac. Some of these devices are more susceptible to malware or easily accessible to fraudsters and, therefore, may be associated with a higher risk of fraud. To include device type information in numeric form, we may apply one-hot encoding to the device type, which will create a new indicator feature for each device type.
The tables above show an example of one-hot encoding. Here we have created a set of new numeric indicator features by applying one-hot encoding technique to the non-numeric categorical feature “Device Type”.

6. Target Encoding

This technique is applied to the same type of features that we would apply the one-hot encoding to but has some advantages and disadvantages over one-hot encoding. When the number of categories is high (high cardinality), using one-hot encoding will undesirably increase the number of features, which may lead to model overfitting. Target encoding can be an effective technique in such cases, provided we are working on a supervised learning problem. It is a technique that maps each category value to the expected value of the target for that category. If working with a regression problem with a continuous target, this calculation maps the category to the mean target value for that category. In the case of a classification problem with a binary target, target encoding will map the category to the positive event probability of that category. Unlike one-hot encoding, this technique has the advantage of not increasing the number of features. A downside of this technique is that it can only be applied to supervised learning problems. Applying this technique may also make the model susceptible to overfitting, particularly if the number of observations in some categories is low.

Examples:
1. Merchant name: Transactions placed against certain merchants could indicate fraudulent activity. There could be thousands of such merchants, each with a different risk of fraudulent transactions. Applying one-hot encoding to a feature containing merchant names may introduce thousands of new features, which is undesirable. In such cases, target encoding can help capture the merchant’s fraud risk information without increasing the number of features.
2. Transaction zip code: Just like merchants, transactions made in different zip codes may represent different fraud risk levels. Although zip codes have numeric values, they are not continuous measurement variables and should not be used in the model as is. Instead, we can incorporate the fraud risk information associated with each zip code by applying a technique like target encoding.
The tables above show an example of target encoding. Here we have created a single new numeric feature “Merchant Name target encoding” by applying target encoding technique to a non-numeric categorical feature “Merchant Name”. As the name suggests, this technique relies on target values to compute the new feature values.

Once we have created the new features from the raw data, the next step is to process them for optimal model performance. We accomplish this though feature processing as discussed in the next section.

Feature Processing

Feature processing refers to series of data processing steps that ensure that the machine learning models fit the data as intended. While some of these processing steps are required when using certain machine learning algorithms, others ensure that we strike a good working chemistry between the features and the machine learning algorithm under consideration. In this section, let’s discuss some common feature processing steps and why we need them.

1. Outlier Treatment

Several machine learning algorithms, especially parametric ones such as regression models, are severely impacted by outliers. These machine learning algorithms attempt to accommodate outliers, severely affecting the model parameters and compromising overall performance. To treat the outliers, we must first identify them. We can detect outliers for a specific feature by applying certain rules of thumb, such as having an absolute value greater than the mean plus three standard deviations or a value outside the nearest whisker value (nearest quartile value plus 1.5 times the interquartile range value). Once we have identified the outliers in a specific feature, we can use some of the techniques below to treat outliers:
1. Deletion: we can delete the observations with at least one outlier value. However, if our data has too many outlier values across different features, we may lose many observations.
2. Substituting: We can substitute outlier values with averages, such as the mean, median, and mode, of a given feature.
3. Feature transformation or standardization: we can use log transformation or feature standardization (as described in scaling) to reduce the magnitude of the outliers.
4. Capping and flooring: we can replace the outliers beyond a certain value with that value, for example, replacing all values above the 99th percentile with the 99th percentile value and replacing all values below the 1st percentile with the 1st percentile value.
The image above shows the two commonly used techniques for detecting univariate outliers. We can see that the two techniques can yield different set of outliers. The mean+3 SD technique should be used if the data follows a normal distribution. The boxplot whisker based technique is more generic and can be applied to data with any distribution.

Note that there are techniques to detect observations that are multivariate outliers (outliers with respect to multiple features), but they are more complex and generally do not add much value in terms of machine learning model training. Also note that outliers are not a concern when working with most non-parametric machine learning models like support vector machines and tree-based algorithms like decision trees, random forests, and XGBoost.

2. Missing Values Treatment

Missing data is very common in real-world datasets. Most traditional machine learning algorithms, except a few like XGBoost, don’t allow missing values in training datasets. Thus, fixing missing values is one of the routine tasks in machine learning modeling. There are several techniques to treat missing values; however, before implementing any technique, it is important to understand the cause of the missing data or, at the very least, know if the data is missing at random. If the data is not missing at random, meaning certain subgroups are more likely to have missing data, imputing values for those might be difficult, especially if there is little to no data available. If the data is missing at random, we can use some of the common treatment techniques described below. They all have pros and cons, and it’s up to us to decide what method best suits our use case.
1. Deletion: We can delete the observations with at least one missing feature value. However, if our data has too many missing values across different features, we may end up losing many observations.
2. Dropping: If a feature has a large number of missing values, we can choose to drop it.
3. Substituting with averages: We can use averages like the mean, median, and mode of a given feature to substitute for the missing values. This method is simple to implement, but it may not provide good estimates for all types of observations. For example, a high fraud risk transaction may have a different average transaction amount than a low fraud risk transaction amount, and using an overall average for a missing high fraud risk transaction amount may not be a good substitution.
4. Maximum likelihood, multiple imputations, K nearest neighbors: These are more complex methods that consider the relationship with other features in the dataset and could provide more accurate estimates than overall averages. However, implementing these methods will require additional modeling or algorithm implementation.
The tables above show the application of commonly used techniques for missing values treatment.

3. Scaling

Often, features that we use in machine learning models have different ranges. If we use them without scaling, the features with large absolute values will dominate the prediction outcome. Instead, to give each feature a fair opportunity to contribute to the prediction outcome, we must bring all features on the same scale. The two most common scaling techniques are:
1. Normalization: This scaling technique restricts the feature values between 0 and 1. To apply normalization, we subtract the minimum feature value and divide it by the range (difference between min and max) of that feature. Normalization may not be a good technique if some of our features have a sharp skew or have a few extreme outliers.
2. Standardization: This technique transforms the feature data distribution to the standard normal distribution. We can implement this technique by subtracting the mean and dividing it by the standard deviation. This technique is generally preferred if the feature has a sharp skew or a few extreme outliers.
Note that tree-based algorithms like decision trees, random forest, XGBoost, and others can work with unscaled data and do not need scaling when using these algorithms.

The tables above shows the application of the two commonly used feature scaling techniques.

The image above shows the scale difference between the original, normalized and standardized feature values. As we can see, scaling does not affect the shape of the data distribution.

4. Dimensionality Reduction

Today, we have enormous data, and we can build a vast collection of features to train our models. For most algorithms, having more features is good since it provides more options to improve the model performance. However, this is not true for all algorithms. Algorithms based on distance metrics suffer from the curse of dimensionality — as the number of features increases substantially, the distance value between the two observations becomes meaningless. Thus, to use algorithms that rely on distance metrics, we should ensure that we are not using a large number of features. If our dataset has a large number of features and if we don’t know which ones to keep and which to discard, we can use techniques like Principal component analysis (PCA). PCA transforms the set of old features into a set of new features. It creates new features such that the one with the highest eigenvalues captures most of the information from the old features. We can then keep only the top few new features and discard the remaining ones.

Other statistical techniques, such as association analysis and feature selection algorithms, can be used in supervised learning problems to reduce the number of features. However, they generally do not capture the same level of information that PCA does with the same number of features.

The tables above shows the application of PCA for feature reduction. As we can see the first three feature capture over 87% of the information contained in the original dataset. In this case, we can choose to leave out the two features (f4 and f5) for a loss of <13% information. The number of features to keep and the number of features to eliminate will vary from problem to problem depending upon various factors

5. Transforming to Normal Distribution

This step is an exception because it only applies to the target and not to the features. Also, most machine learning algorithms don’t have any restrictions on the target’s distribution, but certain ones like linear regression, require that the target to be distributed normally. Linear regression assumes that the error values are symmetric and concentrated around zero for all the data points (just like the shape of the normal distribution), and a normally distributed target variable ensures that this assumption is met. We can understand our target’s distribution by plotting a histogram. Statistical tests like the Shapiro-Wilk test tell us about the normality by testing this hypothesis. In case our target is not normally distributed, we can try out various transformations such as log transform, square transform, square root transform, and others to check which transforms make the target distribution normal. There is also a Box-Cox transformation that tries out multiple parameter values, and we can choose the one that best transforms our target’s distribution to normal.

The image above shows three transformations of the original target data. In this specific case, we can see that the log transformation works the best to transform the original data distribution to a normal distribution.

Note: While we can implement the feature processing steps in features in any order, we must thoroughly consider the sequence of their application. For example, missing value treatment using value mean substitution can be implemented before or after outlier detection. However, the mean value used for substitution may differ depending on whether we treat the missing values before or after the outlier treatment. The feature processing sequence outlined in this article treats the issues in the order of the impact they can have on the successive processing steps. Thus, following this sequence should generally be effective for addressing most problems.

Conclusion

As mentioned in the introduction, feature engineering is a dimension of machine learning that allows us to control the model’s performance to an exceptional degree. To exploit feature engineering to its potential, we learned various techniques in this article to create new features and process them to work optimally with machine learning models. No matter what feature engineering principles and techniques from this article you choose to use, the important message here is to understand that machine learning is not just about asking the algorithm to figure out the patterns. It is about us enabling the algorithm to do its job effectively by providing the kind of data it needs.

Unless otherwise noted, all images are by the author.

Feature Engineering for Machine Learning was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Feature Engineering for Machine Learning

Go Here to Read this Fast! Feature Engineering for Machine Learning
May 17, 2024
Backpropagation Through Time — How RNNs Learn

Egor Howell

An explanation of the backpropagation through time algorithm

Continue reading on Towards Data Science »

Originally appeared here:
Backpropagation Through Time — How RNNs Learn

Go Here to Read this Fast! Backpropagation Through Time — How RNNs Learn

May 17, 2024

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Tag: AI

A surprising experiment to show that the devil is in the details

Embeddings Leaderboard

GPT Embeddings

Quality of these GPT Embeddings

Results and Observations

Contrastive Loss Comes to the Rescue

Conclusion

Enabling the algorithm to work its magic

New Feature Creation

1. Aggregation

2. Differences and Ratios

3. Age Encoding

4. Indicator Encoding

5. One-Hot Encoding

6. Target Encoding

Feature Processing

1. Outlier Treatment

2. Missing Values Treatment

3. Scaling

4. Dimensionality Reduction

5. Transforming to Normal Distribution

Conclusion