Counterfactual in program evaluation
Which of the following terms refers to our feelings about someone based on their group membership? What does contradicts mean in a sentence? How and why art is an interpretation? How does change affect identity? How rich are the Cavendish family? What is a Sociobiological theory? Is OWC a legit company? What is Newton prism experiment? What are Christian activities?
You will be interested What did Thomas Hobbes believe about government? What is simplicity in a sentence? How do you make an HTTP request in unity? How many group of human rights are there? Whats Ching mean? Using rigorous empirical studies, we also show that TraCE outperforms existing baseline methods, in terms of several widely adopted evaluation metrics in counterfactual reasoning.
Furthermore, we find that TraCE can effectively detect shortcuts or unintended biases in trained models and infer relationships between different attributes for example, age and diagnosis state , thus enabling a holistic understanding of deep clinical models. The growing interest in employing machine learning ML based solutions to design diagnostic tools and to gain new insights into a host of medical conditions strongly emphasizes the need for a rigorous characterization of ML algorithms.
In conventional statistics, uncertainty quantification UQ provides this characterization by studying the impact of different error sources on the prediction 35 , 36 , Consequently, several recent efforts have proposed to utilize prediction uncertainties in deep models to shed light onto when and how much to trust the predictions 38 , 39 , Some of the most popular uncertainty estimation methods today include: 1 Bayesian neural networks 37 , 41 : 2 methods that use the discrepancy between different models as a proxy for uncertainty, such as deep ensembles 42 and Monte—Carlo dropout that approximates Bayesian posteriors on the weight-space of a model 38 ; and 3 approaches that use a single model to estimate uncertainties, such as orthonormal certificates 43 , deterministic uncertainty quantification 44 , distance awareness 45 , depth uncertainty 46 , direct epistemic uncertainty prediction 47 and accuracy versus uncertainty calibration It has been reported in several studies that deep predictive models need not be inherently well-calibrated 27 , i.
While uncertainties can be directly leveraged for a variety of downstream tasks including out-of-distribution detection and sequential sample selection, they have also been utilized for guiding models to produce well-calibrated predictions. In practice, these requirements are incorporated as regularization strategies to systematically adjust the predictions during training, most often leading to better performing models.
For example, uncertainties from Monte—Carlo dropout 49 and direct error prediction 50 have been used to perform confidence calibration in deep classifiers. Similarly, the recently proposed Learn-by-Calibrating LbC approach 28 introduced an interval calibration objective based on uncertainty estimates for training deep regression models. Counterfactual CF explanations 20 that synthesize small, interpretable changes to a given image while producing desired changes in model predictions to support user-specified hypotheses e.
An important requirement to produce meaningful counterfactuals is to produce discernible local perturbations for easy interpretability while being realistic close to the underlying data manifold. Consequently, existing approaches rely extensively on pre-trained generative models to synthesize plausible counterfactuals 20 , 21 , 51 , 52 , While the proposed TraCE framework also utilizes a pre-trained generative model, it fundamentally differs from existing approaches by employing uncertainty-based calibration for counterfactual optimization.
Our analysis uses CXR images available as public benchmark data for the tasks of predicting the diagnostic state and other patient attributes. In particular, our study uses the RSNA pneumonia detection challenge database , which is a collection of 30, CXR exams belonging to the NIH CXR14 benchmark dataset 54 , of which 15, exams show evidence for lung opacities related to pneumonia, consolidation and infiltration, and exams contain no findings referred as normal.
The CXR images in the dataset were annotated by six board-certified radiologists and additional information on the data curation process can be found in Ref. In addition to the diagnostic labels, this dataset contains age and gender information of the subjects. Note that, for this analysis, we used healthy control subjects from the RSNA pneumonia dataset to define the normal group and designed predictive models to discriminate them from patients presenting pneumonia-related anomalies in their CXR scans.
We refer to the latter as the abnormal group. We used the following metrics for a holistic evaluation of the counterfactual explanations obtained using TraCE and other baseline methods. Validity For categorical attributes as in classification problems , this metric measures the ratio of the counterfactuals that actually have the desired target attribute to the total number of counterfactuals generated higher the better.
In the case of continuous-valued attributes we measure the mean absolute percentage error MAPE between the desired and achieved target values lower the better. Sparsity Since we perform optimization directly in the latent space, measuring the amount of change in the images is a popular metric in the literature. We compute the sparsity metric as the ratio of the number of pixels altered to the total number of pixels.
In general, sparser changes to an image are more likely to preserve the inherent characteristics of the query image. Proximity Recent works have considered the actionability of modified features by grounding them in the training data distribution. Realism score We also employ this metric from the generative modeling literature 57 to evaluate the quality of images obtained using TraCE. Hence, we utilize the realism score introduced in Ref.
Given the rapid adoption of AI solutions in diagnosis and prognosis, it is critical to gain insights into black-box predictive models. In this study, we analyzed a predictive model that classifies CXR images into normal and abnormal groups, and used TraCE to synthesize counterfactuals for a given query image from the normal class to visualize the progression of disease severity.
Such an analysis can reveal what image signatures are introduced by a predictive model to provide evidence for the abnormal class, and can be used by practitioners to verify if the model relies on meaningful decision rules or shortcuts e.
In our implementation of TraCE, we first constructed a low-dimensional latent space dimensions for the dataset of CXR images using a Wasserstein auto-encoder Figure 2 illustrates the counterfactuals obtained using TraCE for multiple different examples from our benchmark dataset. These values were obtained using a standard hyper-parameter search based on randomly chosen images. For each case from Fig. It can be clearly observed from the results that the counterfactuals show increased opacity in the lung regions appearing as denser white clouds as we progress towards the abnormal class, which strongly corroborates with existing studies on CXR-based image analysis.
Furthermore, TraCE does not arbitrarily introduce irrelevant features into the image or make anatomical changes, thereby reliably preserving the inherent characteristics of the subject.
By producing physically plausible evidences for crucial hypotheses, TraCE enables practitioners to effectively explore complex decision boundaries learned by deep predictive models. The first striking observation is that, despite using the same pre-trained latent space for counterfactual optimization, all methods that incorporate explicit calibration strategies or uncertainty estimation consistently outperform the Vanilla model.
More specifically, for similar levels of discrepancy in the latent space, TraCE achieves a significantly higher validity score of 0. Furthermore, our approach outperforms the results obtained with state-of-the-art uncertainty estimators and calibration strategies in all the metrics , thus demonstrating its efficacy in generating counterfactual explanations.
As discussed earlier, TraCE is applicable for predictive models outputting both categorical- and continuous-valued target variables. To demonstrate this, we considered only healthy control subjects from the RSNA dataset and designed a regressor to estimate their age attribute using their CXR images.
Though the age prediction task is not necessarily relevant on its own in clinical diagnosis, as we will show next, such attribute estimators can be utilized for inferring relationships to the diagnosis state.
From Table 2 , we notice that the proposed approach achieves lower validity MAPE scores, without compromising on the proximity metric, when compared to the other baselines. Interestingly, we find that changing the age attribute required the manipulation of much lesser number of pixels low sparsity values when compared to the diagnosis state. An important challenge with purely data-driven methods is that they have the risk of inferring decision rules based on shortcuts, thereby limiting their utility in practice.
Detecting such shortcuts is essential to both validate model behavior and to detect unintended biases hospital-specific or device-specific information in the training data. After training the Wasserstein autoencoder and the LbC model using the altered images, we selected query images from the normal group and generated the corresponding counterfactual evidences for the abnormal group.
Similarly, in Fig. This experiment clearly emphasizes the utility of TraCE in detecting model and data biases. Using TraCE to detect shortcuts in deep predictive models.
In this experiment, we synthetically introduced a nuisance feature overlaid the text PNEUMONIA in the top-left corner into all images from the abnormal group, and used this data to train the predictive model. Given the entirely data-driven nature of machine-learned solutions, there is risk of inferring a decision rule based on this irrelevant feature in order to discriminate between normal and abnormal groups.
In each case, we show the query image, the counterfactual explanation from TraCE and the absolute difference image between the two; e , f Here, we introduced the nuisance feature into CXR images from the abnormal group and synthesized counterfactuals for the normal class.
We observe that TraCE can effectively detect such shortcuts—counterfactuals for changing the diagnosis state are predominantly based on manipulating the text on the top-left corner of the query images. Motivated by the effectiveness of TraCE in producing counterfactuals for different types of target attributes, we next explored how counterfactual optimization can be used to study relationships between patient attributes, such as age and gender, and the diagnosis state.
Note, this analysis is based on the assumption that the patient attribute can be directly estimated from the CXR images, and the inferred relationship does not necessarily imply causality. First, we study if the image signatures pertinent to the patient age attribute provides additional evidence for diagnosis state prediction. Note, both predictors were constructed based on the same low-dimensional latent representations.
We then estimated the age-specific and diagnosis-specific signatures introduced by TraCE:. In order to check if there exists an apparent relationship between age and diagnosis state, we generated the hybrid counterfactual,.
An overview of this strategy is illustrated in Fig. Using TraCE to infer relationships between a patient attribute e. For this analysis, we construct two independent predictive models, i. On the other hand, if the attribute is a confounding variable, it becomes critical to retrain the model wherein this sensitivity is explicitly discouraged. Interestingly, when we repeated this analysis with the gender attribute, such a relationship was not apparent see results in Fig.
In this section, we discuss in detail the methodology for performing calibration-driven counterfactual generation. All the methods presented were performed in accordance with the relevant guidelines and regulations. Given a set of samples from an unknown data distribution, our goal is to build a low-dimensional, continuous latent space that respects the true distribution, so that one can generate counterfactual representations in that space. A large class of generative modeling methods exist to construct such a latent space.
In this work, we focus on Wasserstein autoencoders 59 since they have been found to outperform other variational autoencoder formulations, particularly in image datasets with low heterogeneity, e. This helps us to sample from the prior as well as generate new unseen samples from the original data manifold M X after training such auto-encoding models. As shown in Fig. Framework design for TraCE.
Note, we used a combination of maximum mean discrepancy MMD , mean squared error MSE and structural similarity SSIM losses to train the network parameters; b next, we adapt the Learn-by-Calibrating 62 approach to train a classifier that takes as input the latent representation from the encoder and outputs a patient-specific attribute along with prediction intervals.
With the latent space dimensionality fixed at , the encoder model was comprised of 4 convolutional layers, with the number of filters set to [16, 32, 64, 32], followed by two fully connected layers with hidden units as and All convolutional layers used the kernel size 3, 3 and stride 2. The decoder consisted of two fully connected layers with and hidden units followed by 4 transposed convolutional layers with channels [64, 32, 16, 1] respectively.
ReLU non-linear activation was applied after every layer except for the last layer. The three loss functions were assigned the weights [1, 0. While conventional metrics such as cross entropy for categorical-valued outputs and mean squared error for continuous-valued outputs are commonly used, it has been recently found that interval calibration is effective for obtaining accurate and well-calibrated predictive models Hence, in TraCE, we adapt the Learn-by-Calibrating approach to train classifier or regression models that map from the CXR latent space to a desired target variable.
Since interval calibration is defined for continuous-valued targets, we adapt the loss function for training on the logits directly. To this end, we first transform the ground truth labels into logits. Note, for each sample, we allow a small non-zero probability say 0. We repeat the two steps Eqs. As showed in Fig. CE modifies the counterfactual generation process in Eq. Our goal is to generate explanations to support a given hypothesis on the target variable—for example emulating high-confidence disease states given the CXR of a healthy subject.
We propose the following optimization to generate the counterfactual explanations:. The second term ensures that the expected target value is contained in the prediction interval calibration , while the final term penalizes arbitrarily large intervals to avoid trivial solutions. We considered a suite of baseline approaches for our empirical study and they differ by the strategies used for training the classifier, and counterfactual optimization. In particular, we investigate approaches that produce explicit uncertainty estimators as well as those that directly build well-calibrated predictors.
However, note that, all methods perform their optimization in the same latent space. Vanilla In this approach, we train the classifier with no explicit calibration or uncertainty estimation, and use the following formulation to generate the counterfactuals:. Mixup This is a popular augmentation strategy 64 that convexly combines random pairs of images and their labels, in order to temper overconfidence in predictions.
Recently, in Ref. In mixup, the model is trained not only on the training data, but also using samples in the vicinity of each training sample:. Since this approach does not produce any uncertainty estimation, the counterfactual optimization is same as that of the Vanilla approach in Eq. MC dropout In this baseline, we train the classifier with dropout regularization and estimate the epistemic prediction uncertainty for any test sample by running multiple forward passes.
Finally, we use the following heteroscedastic regression objective to implement uncertainty-based calibration during counterfactual optimization:. Deep ensembles Deep ensembles form an important class of uncertainty estimation methods, wherein the model variance is used as a proxy for uncertainties.
In this approach, we independently train M different models with bootstrapping and different model initializations with the same architecture. Finally, we employ the calibration objective in Eq. While highly accurate and currently one of the best uncertainty estimation techniques, deep ensembles require training multiple models, which can become a computational bottleneck when training deep networks.
More specifically, we perform T forward passes with dropout in the network and promote the softmax probabilities to be closer to an uniform distribution, i. Since the model is inherently calibrated during training, we do not measure the uncertainties at test time and hence use the optimization in Eq. For the case of continuous-valued targets i. All datasets used in this were obtained from publicly released databases and pre-processed using open-source tool chains.
We have added appropriate links to obtain the data as well as access the scripts for pre-processing, wherever applicable. The software associated with this paper will be hosted through a public code repository github. Faust, O. Deep learning for healthcare applications based on physiological signals: A review. Methods Programs Biomed. Article Google Scholar. Kononenko, I. Machine learning for medical diagnosis: history, state of the art and perspective.
Miotto, R. Deep learning for healthcare: review, opportunities and challenges. Develop a hypothetical prediction of what would have happened in the absence of the intervention.
Login Login and comment as BetterEvaluation member or simply fill out the fields below. For a discussion about counterfactual approaches to causal inference, see The Stanford Encyclopedia of Philosophy entry Options There are three clusters of options for this task: Experimental options or research designs Develop a counterfactual using a control group. Control Group : a group created through random assignment who do not receive a program, or receive the usual program when a new version is being evaluated.
An essential elements of the Randomized Controlled Trial approach to impact evaluation. Download a summary of the tasks, options, and approaches associated with understanding causes of outcomes and impacts. Share RSS Print version.
0コメント