Integrated Gradients
Integrated Gradients is a model-agnostic method for attributing feature importance to each input feature of a deep neural network. The goal of Integrated Gradients is to compute the contribution of each input feature to the final prediction of the network, by measuring how much the prediction changes when that feature is varied.
​
​
​
​
​
​
​
​
​
Source: Explaining Machine Learning Models
​
Integrated Gradients works by computing the integral of the gradients of the output with respect to the input along a straight-line path from a baseline input to the actual input. The baseline input is a reference input that is used to determine the starting point of the integration path. The idea is to vary each input feature from its baseline value to its actual value, while measuring the change in the output prediction. This change is then multiplied by the local gradient of the output with respect to the input at each point along the integration path, yielding a set of feature attribution scores that indicate the contribution of each input feature to the output prediction.
​
​
​
​
​
​
​
​
​
Source: Understanding Deep Learning Models with Integrated Gradients
​
The key advantage of Integrated Gradients is that it provides a principled way to attribute feature importance to each input feature of a deep neural network, based on the magnitude and direction of the feature's effect on the output prediction. Integrated Gradients has been used in a variety of applications, including image classification, natural language processing, and bioinformatics, and has been shown to be effective in identifying important input features and detecting potential sources of bias or error in deep neural networks.