Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optim-wip: Improve docs for Sphinx #983

Open
wants to merge 531 commits into
base: optim-wip
Choose a base branch
from

Conversation

ProGamerGov
Copy link
Contributor

@ProGamerGov ProGamerGov commented Jun 27, 2022

This PR applies the same doc improvements for Sphinx from the other PRs, to the remaining files.

aobo-y and others added 28 commits December 4, 2023 00:48
Summary:
as title

Pull Request resolved: pytorch#1216

Reviewed By: vivekmig

Differential Revision: D51795314

Pulled By: aobo-y

fbshipit-source-id: c7eca5e9c32f4e582052773cf72a87967a30a4eb
Summary:
as title

Pull Request resolved: pytorch#1217

Reviewed By: vivekmig

Differential Revision: D51803181

Pulled By: aobo-y

fbshipit-source-id: 81d9bb4bb5d839ac12b98e2344c87ec59b5f0718
…lize_t… (pytorch#1152)

Summary:
The default argument for `method` in `captum.attr.visualization.visualize_timeseries_attr` is currently `"individual_channels"`, which is not a valid option, resulting in an exception if used. This PR changes the default method to `"overlay_individual"`, which is what the docs indicate the default should be.

Pull Request resolved: pytorch#1152

Reviewed By: aobo-y

Differential Revision: D47197267

Pulled By: vivekmig

fbshipit-source-id: 4d87f792b742fafbb9c30e84247c830e93df1187
Summary:
Adds tests for missing versions of PyTorch to make sure tests cover all supported PyTorch versions.

Pull Request resolved: pytorch#1218

Reviewed By: aobo-y

Differential Revision: D51823341

Pulled By: vivekmig

fbshipit-source-id: 395836ca7683046c99ec2aeaf90c3dd65b1da37b
Summary:
to be merged after everything is ready

Pull Request resolved: pytorch#1219

Reviewed By: vivekmig

Differential Revision: D51808995

Pulled By: aobo-y

fbshipit-source-id: cd4a57a76f1666673352fd669c81ed25fb53571f
Summary:
Pull Request resolved: pytorch#1214

Pull Request resolved: pytorch#1186

# Overview
This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf).
- `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf)
- `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al.  These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it.

This diff is rebased on top of D41324297, which implements the new API.

Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here.

# What is the "infinitesimal" influence score
More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint.

# What the two implementations have in common
Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors.
This approximation is useful for several reasons:
- It avoids numerical issues associated with inverting small eigenvalues
- Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings.  Because k is small, i.e. 50, these influence embeddings are low-dimensional.
- Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples.

The implementations differ in how they compute the top-k eigenvalues / eigenvectors.

# How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors
It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details.

# How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors
The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors.

This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example.

Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains.

# High-level organization of the two implementations
Because of the common logic of the two implementations, they share the same high-level organization.
- Both implementations accept a `hessian_dataset` initialization argument.  This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`.
- in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively.
- `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings.
- Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method.

# Reason for inheritance structure
`InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score).  Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways.  `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`).  In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al.

# Key helper methods
- `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary).  This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`.
- `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`.  This is done by calculating the Hessian over individual batches, and then summing them up.  One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor.  Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch).  This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`.
- `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`.  It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor.  Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor.  Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`.  This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it.
- `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian.  This is what is needed to compute `R`.  It is used by `ArnoldiInfluenceFunction`.

# Tests
We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests:
#### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff:
- `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression.  Different reductions for loss function - 'mean', 'sum', 'none' are tested.  Here, we test the following implementation:
-- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it.  In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues.
- `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation).  This tests checks that flattening and unflattening a tuple of tensors gives the original tensor.
- `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix.  Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it.  This checks that the wrapper is working properly.
#### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff:
- `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct.  In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements.
- `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse.
- `test_matches_linear_regression` where the implementation tested is the following:
-- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value.  The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level.
- When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree.  This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively.

# Minor changes / functionalities / tests
- `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations
- `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`.  This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`.
- some helpers are moved from `tracincp` to `captum.influence._utils.common`
- a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn`
- `compute_intermediate_quantities` for both implementations support the `aggregate` option.  This means that both implementations can be used with D40386079, the validation influence FAIM workflow.
- given the aforementioned tests, testing now generates multiple kinds of models / data.  The ability to do so is added to `get_random_model_and_data`.  The specific model (and its parameters) are specified by the `model_type` argument.  Before, the method only supports the random 2-layer NN.  Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD.
- `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function.  Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`.  Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`.  The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`.

Reviewed By: NarineK

Differential Revision: D40541294

fbshipit-source-id: 349efeeba67291baf9ff6538ac145a0da7aa006d
Summary:
Pull Request resolved: pytorch#1187

This diff implements `ArnoldiInfluenceFunction`, which was described, along with `NaiveInfluenceFunction` in D40541294.  Please see that diff for detailed description.  Previously implementations of both methods had been 1 diff.  Now, `ArnoldiInfluenceFunction` is separated out for easier review.

Reviewed By: vivekmig

Differential Revision: D42006733

fbshipit-source-id: 14e82d30d56fb75dcdb5e77db9c93d626430a74f
…ytorch#1224)

Summary:
Default generation in transformers utilizes past_key_values to cache previous key values to speed up forward passes for subsequent tokens. This adds a flag and use of corresponding helpers from transformers generation utils to follow the same approach for using caching.

Using this flag leads to about a 10x speedup with 10 target tokens, and improvement seems to scale with number of target tokens.

Pull Request resolved: pytorch#1224

Reviewed By: aobo-y

Differential Revision: D52240469

Pulled By: vivekmig

fbshipit-source-id: e643458529091fb5540b0b0a374ceb0c2c25e394
Summary:
as title

The tutorial only demonstrate the perturbation-based algorithms. Will add the gradient-based demo later.

rendered notebook:
https://github.com/aobo-y/captum/blob/llm/tutorials/Llama2_LLM_Attribution.ipynb

Pull Request resolved: pytorch#1228

Reviewed By: vivekmig

Differential Revision: D52476602

Pulled By: aobo-y

fbshipit-source-id: 5565fc41c163cff1ddbf32ac8def52aba38b7d1e
Summary: Fix issues causing lint failures for autodeps

Reviewed By: aobo-y

Differential Revision: D53022779

fbshipit-source-id: 86e617b7e14a0bdb98b1552de71940062b55d094
Reviewed By: azad-meta

Differential Revision: D53401552

fbshipit-source-id: 84da10300de3490ac0fcbcf5394f28c8c33fbd9b
Summary:
Flake8 is currently failing on master due to a flake8 error in test_pytext due to a line exceeding limit, resolving error.

Pull Request resolved: pytorch#1239

Reviewed By: cyrjano

Differential Revision: D53839830

Pulled By: vivekmig

fbshipit-source-id: c6aecc460198bcc3f1b9f31f3f27ad94d58f8dc9
Summary:
By adding that single conversion from float64 to float32 the integrated gradients is fully compatible with MPS backend, which it wasn't previously. The change can be accepted as valid for any backend since float64 is a precision almost nobody using nowadays in deep learning. It seems that the default for torch.tensor is not updated with current trends.

Pull Request resolved: pytorch#1227

Reviewed By: cyrjano

Differential Revision: D54047349

Pulled By: vivekmig

fbshipit-source-id: 1ffda83f065a3f14fa3c5b0229fe0feb5035cc99
Summary: From result of `pyre -l ./pytorch/captum/ infer -i ./pytorch/captum/`, most changes are trivial.

Reviewed By: cyrjano

Differential Revision: D54383416

fbshipit-source-id: 38eff6dbd11dc5ef24682b48f81d138e30a2be5e
Summary:
Formats the covered files with pyfmt.

paintitblack

Reviewed By: aleivag

Differential Revision: D54447730

fbshipit-source-id: 85ed104b2f8f5e26ae0dea9ee17392ecad8b9407
Summary: From result of pyre -l ./pytorch/captum/ infer -i ./pytorch/captum/, most changes are trivial.

Differential Revision: D54411366

fbshipit-source-id: 4729a9466572e9f8a3709dd5c4abf3b075e07af0
Summary: As titled. The param_num can be used to ensure unique test name and avoid pyre lint.

Differential Revision: D54647550

fbshipit-source-id: d72f1d9095486d1a7044674510e1ba019932f0a3
Summary:
Add support for layer attribution via permutation by combining the existing `LayerFeatureAblation` and `FeaturePermutation` attribution classes.

See this [doc](https://docs.google.com/document/d/1HwlBYKOEhguA_9OVrndjuBXE5npr7o6rVntFMF8KDtU/edit#heading=h.fuwkwbjpq8z) for design.

Unit tests will be added in a follow-up diff from yucu.

Differential Revision: D54551200

fbshipit-source-id: 419fd5a2ba0129aae9eb90e8af681721483e4ea2
Summary: As titled. There are strong rules on most of the fields in influence algorithms. Make use of Python properties to make the structure cleaner and enforce rules even when client code tries to change the field.

Reviewed By: yucu

Differential Revision: D54663278

fbshipit-source-id: 273a82e4707c1b0a7afb8637459b4e34bf381d89
Summary:
Currently, OSS GitHub Actions tests are failing due to failing test, lint and typing issues. This updates the black version used externally (and corresponding python version to support latest black) to match the internal updates in D54447730 and also updates flake8 settings to avoid incompatibilities. Typing issues are also resolved and imports from torch._tensor are removed, since these are not supported for previous torch versions.

Pull Request resolved: pytorch#1241

Reviewed By: cyrjano

Differential Revision: D54901754

Pulled By: vivekmig

fbshipit-source-id: 2b94bf36488b11b6c145175cfe10fc5433b014fe
Summary:
Pull Request resolved: pytorch#1243

The _check_loss_fn() has exact same logic when sample_wise_grads_per_batch is None and True cases. Thus simplify the logic.

Reviewed By: vivekmig

Differential Revision: D54883319

fbshipit-source-id: d7f945906946f4144f7e0acf51f11721c732d9a6
Summary:
Pull Request resolved: pytorch#1244

As titled.

Reviewed By: vivekmig

Differential Revision: D54878651

fbshipit-source-id: 78c87a6264cf9ee89322289cd7fc83a7f41d59b4
Summary:
Pull Request resolved: pytorch#1245

Currently `FeaturePermutation` and `FeatureAblation` both throw a device mismatch issue in https://fburl.com/code/9mfuidf4 because the `current_mask` is always created on CPU and never moved to the same device as `expanded_input` when CUDA is available.

Reviewed By: cyrjano, yucu, vivekmig

Differential Revision: D54969675

fbshipit-source-id: 4e474779edb7b93d345e80e3214777181e64cd81
Summary:
Pull Request resolved: pytorch#1248

As titled. Add comment for the rules for clarity.

Reviewed By: cyrjano, vivekmig

Differential Revision: D55035875

fbshipit-source-id: 303b067df6f62e75cc0ebc9dddacea0a6344a935
…1249)

Summary:
Pull Request resolved: pytorch#1249

As titled.
Isolate the rules to individual method for better structure and complete verification at run time.

Reviewed By: vivekmig

Differential Revision: D55038334

fbshipit-source-id: 297309c1dddbaacac65d5ffa1a08889e6b6310e6
Summary:
Pull Request resolved: pytorch#1247

as titled.

Reviewed By: vivekmig

Differential Revision: D55035833

fbshipit-source-id: 2f34177111869804163ea9e91340d9a0a2332132
Summary:
Pull Request resolved: pytorch#1250

Create separate TARGETS files for different part of Captum project. Start with a relatively simple one: captum/_utils.

Mostly TARGETS file change, with more exception to correct import, split test helper function to separate file etc.

Reviewed By: cyrjano

Differential Revision: D55091069

fbshipit-source-id: 83cbd8a632c8ba71d60d14859bbc549f7ae7b511
Summary:
Pull Request resolved: pytorch#1251

As titled. The sample data type is fixed after construction and should be handled separately based on its type.

Reviewed By: vivekmig

Differential Revision: D55153967

fbshipit-source-id: 170eec1773851260ca27fdc5b1f247154276ec7d
sarahtranfb and others added 30 commits March 5, 2025 14:37
Summary:
Pull Request resolved: pytorch#1522

Example failure:
https://www.internalfb.com/intern/testinfra/testconsole/testrun/4785074873255877/
Passed and failed on rev 3c307b1e123f2007d69464836099b12fa4656423, so not due to a code change

Running locally at least, I see:
```
E0305 00:39:14.685009 1133092 socket.cpp:1019] [c10d] The client socket has timed out after 600000ms while trying to connect to (127.0.0.1, 29500).
```

Related thread: https://fb.workplace.com/groups/319878845696681/permalink/1241443370206886/

There's only one process in this test (`world_size=1`) so should be ok to use `localhost` instead of `127.0.0.1`

Reviewed By: cyrjano

Differential Revision: D70637306

fbshipit-source-id: 111bc966097dbcccfea57ee152edf4eb39c48179
…d layers. Simple logging for unsupported layers (pytorch#1505)

Summary:
Pull Request resolved: pytorch#1505

We are adding test for unsupported gradient layers. Open to ideas if there is a better way to structure the test.

A bit uncomfortable with removing pyre type validations as we allow anything to be passed into the GradientUnsupportedLayerOutput class.

Reviewed By: craymichael

Differential Revision: D69792994

fbshipit-source-id: 8b8ca70ef5ee83fb00c613f233747e1c19c15088
…ytorch#1525)

Summary:
Pull Request resolved: pytorch#1525

Shapley Values currently have issues with per task importance, since aggregate mode returns more than 1 output with perturbations per eval = 1, which should apply aggregate mode for collating perturbation results.

Updates logic to appropriately handle multiple outputs (not matching batch size) when perturbations per eval = 1

Reviewed By: MarcioPorto

Differential Revision: D70832826

fbshipit-source-id: 52e1e40d599f662ac522eae4830560cf1338f7e1
…ermutation/ablation (pytorch#1527)

Summary:
Pull Request resolved: pytorch#1527

Study: https://docs.google.com/spreadsheets/d/1GyNJJBrNkazGOyJQLv00QV4phX2R3488oNgVPT17qzU/edit?gid=0#gid=0
Saw a regression in the new logic introduced in D69531512 with one of the models for both permutation and ablation methods, potentially due to large sparse features. vivekmig suggested we can avoid creating all these zero tensors

Reviewed By: craymichael

Differential Revision: D71057703

fbshipit-source-id: 3c4acc00b82de3fff7322c4f7cf99ad87fed1d02
…h#1521)

Summary:
Pull Request resolved: pytorch#1521

This is to make sure that we control for when the output is not a 2D tensor

* If the shape of the model output is a 0D, it would fail since LayerGradientXActivation always [assumes](https://www.internalfb.com/code/fbsource/[ffa152e31f81]/fbcode/pytorch/captum/captum/_utils/gradient.py?lines=681) that output (output[0] would raise an index error) for a task is 1D.
  * I propose we raise an assertion error if output is 0D and ask the user to edit output or output accessor to ensure output > 0D.

* If the model output shape is a 1D, it could either be of size (batch_size) when there’s one target or (n_targets) when there’s only one observation with multiple targets or some kind of aggregated batch loss across multiple targets
  * When it’s size (batch_size), we can assume there’s just one target and get attributions without passing in a target.
  * When it’s size (n_targets), there will be an issue when we call LayerGradientXActivation since we will need to pass in the target parameter to get attribution for each target.
    * We cannot pass in a target when the output is a 1D tensor. LayerGradientXActivation [checks that the output dimension is 2D](https://www.internalfb.com/code/fbsource/[ffa152e31f81]/fbcode/pytorch/captum/captum/_utils/common.py?lines=700-701)
    * The output needs to be 2D with the shape (1 x n_targets). That needs to be done on the output_accessor or forward function to make sure LayerGradientXActivation can account for it.
  * We could check whether output.shape[0] = inputs.shape[0]. If this is the case, we know that the 1D tensor is for one target. If not, then it’s for multiple targets. We could throw an error in the latter case to inform the user that output needs to be 2D if attributing over multiple targets. I worry that this is assuming too much and the assumption would break if there are multiple targets for 1D case but batch_size = n_targets. In this case, we would automatically assume that there's only one target when maybe there isn't.
  * I propose that we keep the assumption that 1D tensor is for one target. In the case that the 1D tensor is for multiple targets, it would fail LayerGradientXActivation anyway unless it’s converted to 2D.

We also include an output accessor that parses a dictionary model output to get 1D tensor for testing.

Reviewed By: vivekmig

Differential Revision: D69876980

fbshipit-source-id: 4c64410c1d0a25f2819f2da31f303f5fe710d3e1
…ough layer. (pytorch#1526)

Summary:
Pull Request resolved: pytorch#1526

We are adding tests for different types of unsupported and non-differentiable layer output. Here, we add a test for layer output that is a tensor of integers.

We split by the cases for unsupported layers from the case when the layer output is used by some tasks and not others.

When layer output is not supported (layer output is a List of Tensors or a Tensor of integers), we don't get attributions and return None for those layers.

In the case when a layer output is not used by a task, we should output a tensor of zeros for that task.

Reviewed By: craymichael

Differential Revision: D70919347

fbshipit-source-id: 191d9d69c78bcf00fa3cbbbd5707154e0f221410
Summary:
While packaging this module for Nix (NixOS/nixpkgs#356087), I noticed during the tests that `flask` and `flask-compress` modules were missing from the `setup.py` file.

Pull Request resolved: pytorch#1442

Reviewed By: cyrjano

Differential Revision: D71814035

Pulled By: jjuncho

fbshipit-source-id: 4e1cbc9e2e43dd88657063fade94405094bb4190
…ytorch#1530)

Summary:
Pull Request resolved: pytorch#1530

This was supported in the old path (when constructing ablated inputs over each input tensor individually) to improve compute efficiency by optionally passing in multiple perturbed inputs to the model fwd function.

Reviewed By: craymichael

Differential Revision: D71435704

fbshipit-source-id: 6f80ebc69a7e51614432127e1b9b175353072d60
…h#1531)

Summary:
Pull Request resolved: pytorch#1531

With `enable_cross_tensor_attribution=True` for `FeatureAblation`/`FeaturePermutation`, ids/indices in the masks are now "global"

Reviewed By: cyrjano

Differential Revision: D71778355

fbshipit-source-id: 445bf3813faf7e34432f35500bf98c7c0899cb8a
Summary: Looping over the features has moved up into `_construct_ablated_input_across_tensors` after D71435704, so we don't need this anymore

Reviewed By: cyrjano

Differential Revision: D72064893

fbshipit-source-id: 0f3ac2c967a577f4bc3f688893319aac3e51000d
update libraries and testing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.