2017 AUnifiedApproachtoInterpretingM

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Predictive Model Interpretation System, SHAP (SHapley Additive exPlanations), Shapley Value.

Notes

Cited By

2018

Quotes

Abstract

Understanding why a model makes a certain prediction can be as crucial as the prediction's accuracy in many applications. However, the highest accuracy for large modern datasets is often achieved by complex models that even experts struggle to interpret, such as ensemble or deep learning models, creating a tension between accuracy and interpretability. In response, various methods have recently been proposed to help users interpret the predictions of complex models, but it is often unclear how these methods are related and when one method is preferable over another. To address this problem, we present a unified framework for interpreting predictions, SHAP (SHapley Additive exPlanations). SHAP assigns each feature an importance value for a particular prediction. Its novel components include: (1) the identification of a new class of additive feature importance measures, and (2) theoretical results showing there is a unique solution in this class with a set of desirable properties. The new class unifies six existing methods, notable because several recent methods in the class lack the proposed desirable properties. Based on insights from this unification, we present new methods that show improved computational performance and/or better consistency with human intuition than previous approaches.

1 Introduction

The ability to correctly interpret a prediction model’s output is extremely important. It engenders appropriate user trust, provides insight into how a model may be improved, and supports understanding of the process being modeled. In some applications, simple models (e.g., linear models) are often preferred for their ease of interpretation, even if they may be less accurate than complex ones.However, the growing availability of big data has increased the benefits of using complex models, so bringing to the forefront the trade-off between accuracy and interpretability of a model’s output. A wide variety of different methods have been recently proposed to address this issue [5, 8, 9, 3, 4, 1]. But an understanding of how these methods relate and when one method is preferable to another is still lacking.

Here, we present a novel unified approach to interpreting model predictions.[1] Our approach leads to three potentially surprising results that bring clarity to the growing space of methods:

  1. We introduce the perspective of viewing any explanation of a model’s prediction as a model itself,which we term the explanation model. This lets us define the class of additive feature attribution methods (Section 2), which unifies six current methods.
  2. We then show that game theory results guaranteeing a unique solution apply to the entire class of additive feature attribution methods (Section 3) and propose SHAP values as a unified measure of feature importance that various methods approximate (Section 4).
  3. We propose new SHAP value estimation methods and demonstrate that they are better aligned with human intuition as measured by user studies and more effectually discriminate among model output classes than several existing methods (Section 5).

2 Additive Feature Attribution Methods

The best explanation of a simple model is the model itself; it perfectly represents itself and is easy to understand. For complex models, such as ensemble methods or deep networks, we cannot use the original model as its own best explanation because it is not easy to understand. Instead, we must use a simpler explanation model, which we define as any interpretable approximation of the original model. We show below that six current explanation methods from the literature all use the same explanation model. This previously unappreciated unity has interesting implications, which we describe in later sections.

Let f be the original prediction model to be explained and g the explanation model. Here, we focus on local methods designed to explain a prediction f(x) based on a single input x, as proposed in LIME [5]. Explanation models often use simplified inputs x0 that map to the original inputs through a mapping function x = hx(x0). Local methods try to ensure g(z0) � f(hx(z0)) whenever z0 � x0. (Note that hx(x0) = x even though x0 may contain less information than x because hx is specific to the current input x.)

Definition 1 Additive feature attribution methods have an explanation model that is a linear function of binary variables:

g(z0) = �0 + MX i=1 �iz0i

(1)

where z0 2 f0; 1gM, M is the number of simplified input features, and �i 2 R. Methods with explanation models matching Definition 1 attribute an effect �i to each feature, and summing the effects of all feature attributions approximates the output f(x) of the original model. Many current methods match Definition 1, several of which are discussed below.

2.1 LIME

The LIME method interprets individual model predictions based on locally approximating the model around a given prediction [5]. The local linear explanation model that LIME uses adheres to Equation 1 exactly and is thus an additive feature attribution method. LIME refers to simplified inputs x0 as “interpretable inputs,” and the mapping x = hx(x0) converts a binary vector of interpretable inputs into the original input space. Different types of hx mappings are used for different input spaces. For bag of words text features, hx converts a vector of 1’s or 0’s (present or not) into the original word count if the simplified input is one, or zero if the simplified input is zero. For images, hx treats the image as a set of super pixels; it then maps 1 to leaving the super pixel as its original value and 0 to replacing the super pixel with an average of neighboring pixels (this is meant to represent being missing).

To find �, LIME minimizes the following objective function: � = arg min g2G L(f; g; �x0 ) + (g): (2)

Faithfulness of the explanation model g(z0) to the original model f(hx(z0)) is enforced through the loss L over a set of samples in the simplified input space weighted by the local kernel �x0 .

penalizes the complexity of g. Since in LIME g follows Equation 1 and L is a squared loss, Equation 2 can be solved using penalized linear regression. 2

2.2 DeepLIFT

DeepLIFT was recently proposed as a recursive prediction explanation method for deep learning [8, 7]. It attributes to each input xi a value C�xi�y that represents the effect of that input being set to a reference value as opposed to its original value. This means that for DeepLIFT, the mapping x = hx(x0) converts binary values into the original inputs, where 1 indicates that an input takes its original value, and 0 indicates that it takes the reference value. The reference value, though chosen by the user, represents a typical uninformative background value for the feature. DeepLIFT uses a "summation-to-delta" property that states:

Xn i=1 C�xi�o = �o; (3) where o =

4 SHAP (SHapley Additive exPlanation) Values

We propose SHAP values as a unified measure of feature importance. These are the Shapley values of a conditional expectation function of the original model; thus, they are the solution to Equation 8, where fx(z0) = f(hx(z0)) = E[f(z) j zS], and S is the set of non-zero indexes in z0 (Figure 1). Based on Sections 2 and 3, SHAP values provide the unique additive feature importance measure that adheres to Properties 1-3 and uses conditional expectations to define simplified inputs. Implicit in this definition of SHAP values is a simplified input mapping, hx(z0) = zS, where zS has missing values for features not in the set S. Since most models cannot handle arbitrary patterns of missing input values, we approximate f(zS) with E[f(z) j zS]. This definition of SHAP values is designed to closely align with the Shapley regression, Shapley sampling, and quantitative input influence feature attributions, while also allowing for connections with LIME, DeepLIFT, and layer-wise relevance propagation.

Figure 1: SHAP (SHapley Additive explanation) values attribute to each feature the change in the expected model prediction when conditioning on that feature. They explain how to get from the base value E[f(z)] that would be predicted if we did not know any features to the current output f(x). This diagram shows a single ordering. When the model is non-linear or the input features are not independent, however, the order in which features are added to the expectation matters, and the SHAP values arise from averaging the �i values across all possible orderings.

The exact computation of SHAP values is challenging. However, by combining insights from current additive feature attribution methods, we can approximate them. We describe two model-agnostic approximation methods, one that is already known (Shapley sampling values) and another that is novel (Kernel SHAP). We also describe four model-type-specific approximation methods, two of which are novel (Max SHAP, Deep SHAP). When using these methods, feature independence and model linearity are two optional assumptions simplifying the computation of the expected values (note that � S is the set of features not in S):

f(hx(z0)) = E[f(z) j zS] SHAP explanation model simplified input mapping (9) = Ez � SjzS [f(z)] expectation over z � S j zS (10) � Ez � S [f(z)] assume feature independence (as in [9, 5, 7, 3]) (11) � f([zS;E[z � S]]): assume model linearity (12)

6 Conclusion

The growing tension between the accuracy and interpretability of model predictions has motivated the development of methods that help users interpret predictions. The SHAP framework identifies the class of additive feature importance methods (which includes six previous methods) and shows there is a unique solution in this class that adheres to desirable properties. The thread of unity that SHAP weaves through the literature is an encouraging sign that common principles about model interpretation can inform the development of future methods.

We presented several different estimation methods for SHAP values, along with proofs and experiments showing that these values are desirable. Promising next steps involve developing faster model-type-specific estimation methods that make fewer assumptions, integrating work on estimating interaction effects from game theory, and defining new explanation model classes.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2017 AUnifiedApproachtoInterpretingMSu-In Lee
Scott M. Lundberg
A Unified Approach to Interpreting Model Predictions2017