jacopoteneggi commited on
Commit
468f744
1 Parent(s): 0da5552

Include How does it work tab

Browse files
app_lib/about.py ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+
3
+
4
+ def about():
5
+ _, centercol, _ = st.columns([1, 3, 1])
6
+ with centercol:
7
+ st.markdown(
8
+ """
9
+ ## Testing Semantic Importance via Betting
10
+
11
+ We briefly present here the main ideas and contributions.
12
+ """
13
+ )
14
+
15
+ st.markdown("""### 1. Setup""")
16
+ st.image(
17
+ "./assets/about/setup.jpg",
18
+ caption="Figure 1: Pictorial representation of the setup.",
19
+ use_column_width=True,
20
+ )
21
+
22
+ st.markdown(
23
+ """
24
+ We consider classification problems with:
25
+
26
+ * **Input image** $X \in \mathcal{X}$.
27
+ * **Feature encoder** $f:~\mathcal{X} \\to \mathbb{R}^d$ that maps input
28
+ images to dense embeddings $H = f(X) \in \mathbb{R}^d$.
29
+ * **Classifier** $g:~\mathbb{R}^d \\to [0,1]^k$ that separates embeddings
30
+ into one of $k$ classes. We do not assume $g$ has a particular form and it
31
+ can be any fixed, potentially nonlinear function.
32
+ * **Concept bank** $c = [c_1, \dots, c_m] \in \mathbb{R}^{d \\times m}$ such
33
+ that $c_j \in \mathbb{R}^d$ is the representation of the $j^{\\text{th}}$ concept.
34
+ We assume thet $c$ is user-defined and that $m$ is small ($m \\approx 20$).
35
+ * **Semantics** $Z = [Z_1, \dots, Z_m] = c^{\\top} H$ where $Z_j \in [-1, 1]$ represents the
36
+ amount of concept $j$ present in the dense embedding of input image $X$.
37
+
38
+ For example:
39
+
40
+ * $f$ is the image encoder of a vision-language model (e.g., CLIP$^1$, OpenCLIP$^2$).
41
+ * $g$ is the zero-shot classifier obtained by encoding *``A photo of a <CLASS_NAME>''* with the
42
+ text encoder of the same vision-language model.
43
+ * $c$ is obtained similarly by encoding the user-defined concepts.
44
+ """
45
+ )
46
+
47
+ st.markdown(
48
+ """
49
+ ### 2. Defining Semantic Importance
50
+
51
+ Our goal is to test the statistical importance of the concepts in $c$ for the
52
+ predictions of the given classifier on a particular image $x$ (capital letters denote random
53
+ variables, and lowercase letters their realizations).
54
+
55
+ We do not train a surrogate, interpretable model and instead consider the original, potentially
56
+ nonlinear classifier $g$. This is because we want to study the semantic importance of
57
+ the model that would be deployed in real-world settings and not a surrogate one that
58
+ might decrease performance.
59
+
60
+ We define importance from the perspective of conditional independence testing because
61
+ it allows for rigorous statistical testing with false positive rate control
62
+ (i.e., Type I error control). That is, the probability of falsely deeming a concept
63
+ important is below a user-defined level $\\alpha \in (0,1)$.
64
+
65
+ For an image $x$, a concept $j$, and a subset $S \subseteq [m] \setminus \{j\}$ (i.e., any
66
+ subset that does not contain $j$), we define the null hypothesis:
67
+
68
+ $$
69
+ H_0:~\hat{Y}_{S \cup \{j\}} \overset{d}{=} \hat{Y}_S,
70
+ $$
71
+ where $\overset{d}{=}$ denotes equality in distribution, and $\\forall C \subseteq [m]$,
72
+ $\hat{Y}_C = g(\widetilde{H}_C)$, $\widetilde{H}_C \sim P_{H \mid Z_C = z_C}$ is the conditional distribution of the dense
73
+ embeddings given the observed concepts in $z_C$, i.e. the semantics of $x$.
74
+ Then, rejecting $H_0$ means the concept $j$ affects the distribution of the response of
75
+ the model, and it is important.
76
+ """
77
+ )
78
+
79
+ st.markdown(
80
+ """
81
+ ### 3. Sampling Conditional Embeddings
82
+ """
83
+ )
84
+ st.image(
85
+ "./assets/about/local_dist.jpg",
86
+ caption=(
87
+ "Figure 2: Example test (i.e., with concept) and null (i.e., without"
88
+ " concept) distributions for a class-specific concept and a non-class"
89
+ " specific one on three images from the Imagenette dataset as a"
90
+ " function of the size of S."
91
+ ),
92
+ use_column_width=True,
93
+ )
94
+ st.markdown(
95
+ """
96
+ In order to test for $H_0$ defined above, we need to sample from the conditional distribution
97
+ of the dense embeddings given certain concepts. This can be seen as solving a linear inverse
98
+ problem stochastically since $Z = c^{\\top} H$. In this work, given that $m$ is small, we use
99
+ nonparametric kernel density estimation (KDE) methods to approximate the target distribution.
100
+
101
+ Intuitively, given a dataset $\{(h^{(i)}, z^{(i)})\}_{i=1}^n$ of dense embeddings with
102
+ their semantics, we:
103
+
104
+ 1. Use a weighted KDE to sample $\widetilde{Z} \sim P_{Z \mid Z_C = z_C}$, and then
105
+ 2. Retrieve the embedding $H^{(i')}$ whose concept representation $Z^{(i')}$ is the
106
+ nearest neighbor of $\widetilde{Z}$ in the dataset.
107
+
108
+ Details on the weighted KDE and the sampling procedure are included in the paper. Figure 2
109
+ shows some example test (i.e., $\hat{Y}_{S \cup \{j\}}$) and
110
+ null (i.e., $\hat{Y}_{S}$) distributions for a class-specific concept and a non-class
111
+ specific one on three images from the Imagenette$^3$ dataset. We can see that the test
112
+ distributions of class-specific concepts are skewed to the right, i.e. including the observed
113
+ class-specific concept increases the output of the predictor. Furthermore, we see the shift
114
+ decreases the more concepts are included in $S$, i.e. if $S$ is larger and it contains more
115
+ information, then the marginal contribution of adding one concept will be smaller.
116
+ On the other hand, including a non-class-specific concept does not change the distribution
117
+ of the response of the model, no matter the size of $S$.
118
+ """
119
+ )
120
+
121
+ st.markdown(
122
+ """
123
+ ### 4. Testing by Betting
124
+
125
+ Instead of classical hypothesis testing techniques based on $p$-values, we propose to
126
+ test for the importance of concepts by *betting*.$^4$ This choice is motivated by two important
127
+ properties of sequential tests:
128
+
129
+ 1. They are **adaptive** to the hardness of the problem. That is, the easier it is to reject
130
+ a null hypothesis, the earlier the test will stop. This induce a natural ranking of importance
131
+ across concepts: if concept $j$ rejects faster than $j'$, then $j$ is more important than $j'$.
132
+
133
+ 2. They are **efficient** because they only use as much data as needed to reject, instead of
134
+ the entire data available as traditional, offline tests.
135
+
136
+ Sequential tests instantiate a game between a *bettor* and *nature*. At every turn of the game,
137
+ the bettor places a wager against the null hypothesis, and the nature reveals the truth. If
138
+ the bettor wins, they will accumulate wealth, or loose some otherwise. More formally, the
139
+ *wealth process* $\{K_t\}_{t \in \mathbb{N}_0}$ is defined as
140
+
141
+ $$
142
+ K_0 = 1, \\quad K_{t+1} = K_t \cdot (1 + v_t\kappa_t),
143
+ $$
144
+ where $v_t \in [-1,1]$ is a betting fraction, and $\kappa_t \in [-1,1]$ is the payoff of the bet.
145
+ Under certain conditions, the wealth process describes a *fair game*, and for $\\alpha \in (0,1)$,
146
+ it holds that
147
+
148
+ $$
149
+ \mathbb{P}_{H_0}[\exists t:~K_t \geq 1/\\alpha] \leq \\alpha.
150
+ $$
151
+
152
+ That is, the wealth process can be used to reject the null hypothesis $H_0$ with
153
+ Type I error control at level $\\alpha$.
154
+
155
+ Briefly, we use ideas of sequential kernelized independence testing (SKIT)$^5$ and define
156
+ the payoff as
157
+
158
+ $$
159
+ \kappa_t \coloneqq \\tanh\left(\\rho_t(\hat{Y}_{S \cup \{j\}}) - \\rho_t(\hat{Y}_S)\\right)
160
+ $$
161
+ and
162
+ $$
163
+ \\rho_t = \widehat{\\text{MMD}}(\hat{Y}_{S \cup \{j\}}, \hat{Y}_S)
164
+ $$
165
+ is the plug-in estimator of the maximum mean discrepancy (MMD)$^6$ between the test and
166
+ null distributions at time $t$. Furthermore, we use the online Newtown step (ONS)$^7$ method
167
+ to choose the betting fraction $v_t$ and ensure exponential growth of the wealth.
168
+ """
169
+ )
170
+
171
+ st.markdown(
172
+ """
173
+ ---
174
+
175
+ **References**
176
+
177
+ [1] CLIP is available at https://github.com/openai/CLIP .
178
+
179
+ [2] OpenCLIP is available at https://github.com/mlfoundations/open_clip .
180
+
181
+ [3] The Imagenette dataset is available at https://github.com/fastai/imagenette .
182
+
183
+ [4] Glenn Shafer. Testing by betting: A strategy for statistical and scientific communication.
184
+ Journal of the Royal Statistical Society Series A: Statistics in Society, 184(2):407-431, 2021.
185
+
186
+ [5] Aleksandr Podkopaev et al. Sequential kernelized independence testing. In International
187
+ Conference on Machine Learning, pages 27957-27993. PMLR, 2023.
188
+
189
+ [6] Arthur Gretton et al. A kernel two-sample test. The Journal of Machine Learning Research,
190
+ 13(1):723-773, 2012.
191
+
192
+ [7] Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online
193
+ learning in banach spaces. In Conference On Learning Theory, pages 1493-1529. PMLR, 2018.
194
+ """
195
+ )
app_lib/demo.py ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import torch
3
+
4
+ from app_lib.test import get_testing_config, load_precomputed_results, test
5
+ from app_lib.user_input import (
6
+ get_advanced_settings,
7
+ get_class_name,
8
+ get_concepts,
9
+ get_image,
10
+ get_model_name,
11
+ )
12
+ from app_lib.viz import viz_results
13
+
14
+
15
+ def _disable():
16
+ st.session_state.disabled = True
17
+
18
+
19
+ def _toggle_sidebar(button):
20
+ if button:
21
+ st.session_state.sidebar_state = "expanded"
22
+ st.experimental_rerun()
23
+
24
+
25
+ def _preload_results(image_name):
26
+ if image_name != st.session_state.image_name:
27
+ st.session_state.image_name = image_name
28
+ st.session_state.tested = False
29
+
30
+ if st.session_state.image_name is not None and not st.session_state.tested:
31
+ st.session_state.results = load_precomputed_results(image_name)
32
+
33
+
34
+ def demo(device=torch.device("cuda" if torch.cuda.is_available() else "cpu")):
35
+ columns = st.columns([0.40, 0.60])
36
+
37
+ with columns[0]:
38
+ st.header("Choose Image and Concepts")
39
+
40
+ image_col, concepts_col = st.columns(2)
41
+
42
+ with image_col:
43
+ image_name, image = get_image()
44
+ st.image(image, use_column_width=True)
45
+
46
+ change_image_button = st.button(
47
+ "Change Image",
48
+ use_container_width=False,
49
+ disabled=st.session_state.disabled,
50
+ )
51
+ _toggle_sidebar(change_image_button)
52
+
53
+ with concepts_col:
54
+ model_name = get_model_name()
55
+ class_name, class_ready, class_error = get_class_name(image_name)
56
+ concepts, concepts_ready, concepts_error = get_concepts(image_name)
57
+
58
+ ready = class_ready and concepts_ready
59
+
60
+ error_message = ""
61
+ if class_error is not None:
62
+ error_message += f"- {class_error}\n"
63
+ if concepts_error is not None:
64
+ error_message += f"- {concepts_error}\n"
65
+ if error_message:
66
+ st.error(error_message)
67
+
68
+ with st.container():
69
+ (
70
+ significance_level,
71
+ tau_max,
72
+ r,
73
+ cardinality,
74
+ dataset_name,
75
+ ) = get_advanced_settings(concepts, concepts_ready)
76
+
77
+ test_button = st.button(
78
+ "Test Concepts",
79
+ use_container_width=True,
80
+ on_click=_disable,
81
+ disabled=st.session_state.disabled or not ready,
82
+ )
83
+
84
+ if test_button:
85
+ st.session_state.results = None
86
+
87
+ with columns[1]:
88
+ viz_results()
89
+
90
+ testing_config = get_testing_config(
91
+ significance_level=significance_level, tau_max=tau_max, r=r
92
+ )
93
+
94
+ with columns[0]:
95
+ results = test(
96
+ testing_config,
97
+ image,
98
+ class_name,
99
+ concepts,
100
+ cardinality,
101
+ dataset_name,
102
+ model_name,
103
+ device=device,
104
+ )
105
+
106
+ st.session_state.tested = True
107
+ st.session_state.results = results
108
+ st.session_state.disabled = False
109
+ st.experimental_rerun()
110
+ else:
111
+ _preload_results(image_name)
112
+
113
+ with columns[1]:
114
+ viz_results()
app_lib/main.py CHANGED
@@ -1,114 +1,14 @@
1
  import streamlit as st
2
- import torch
3
 
4
- from app_lib.test import get_testing_config, load_precomputed_results, test
5
- from app_lib.user_input import (
6
- get_advanced_settings,
7
- get_class_name,
8
- get_concepts,
9
- get_image,
10
- get_model_name,
11
- )
12
- from app_lib.viz import viz_results
13
 
14
 
15
- def _disable():
16
- st.session_state.disabled = True
17
 
 
 
18
 
19
- def _toggle_sidebar(button):
20
- if button:
21
- st.session_state.sidebar_state = "expanded"
22
- st.experimental_rerun()
23
-
24
-
25
- def _preload_results(image_name):
26
- if image_name != st.session_state.image_name:
27
- st.session_state.image_name = image_name
28
- st.session_state.tested = False
29
-
30
- if st.session_state.image_name is not None and not st.session_state.tested:
31
- st.session_state.results = load_precomputed_results(image_name)
32
-
33
-
34
- def main(device=torch.device("cuda" if torch.cuda.is_available() else "cpu")):
35
- columns = st.columns([0.40, 0.60])
36
-
37
- with columns[0]:
38
- st.header("Choose Image and Concepts")
39
-
40
- image_col, concepts_col = st.columns(2)
41
-
42
- with image_col:
43
- image_name, image = get_image()
44
- st.image(image, use_column_width=True)
45
-
46
- change_image_button = st.button(
47
- "Change Image",
48
- use_container_width=False,
49
- disabled=st.session_state.disabled,
50
- )
51
- _toggle_sidebar(change_image_button)
52
-
53
- with concepts_col:
54
- model_name = get_model_name()
55
- class_name, class_ready, class_error = get_class_name(image_name)
56
- concepts, concepts_ready, concepts_error = get_concepts(image_name)
57
-
58
- ready = class_ready and concepts_ready
59
-
60
- error_message = ""
61
- if class_error is not None:
62
- error_message += f"- {class_error}\n"
63
- if concepts_error is not None:
64
- error_message += f"- {concepts_error}\n"
65
- if error_message:
66
- st.error(error_message)
67
-
68
- with st.container():
69
- (
70
- significance_level,
71
- tau_max,
72
- r,
73
- cardinality,
74
- dataset_name,
75
- ) = get_advanced_settings(concepts, concepts_ready)
76
-
77
- test_button = st.button(
78
- "Test Concepts",
79
- use_container_width=True,
80
- on_click=_disable,
81
- disabled=st.session_state.disabled or not ready,
82
- )
83
-
84
- if test_button:
85
- st.session_state.results = None
86
-
87
- with columns[1]:
88
- viz_results()
89
-
90
- testing_config = get_testing_config(
91
- significance_level=significance_level, tau_max=tau_max, r=r
92
- )
93
-
94
- with columns[0]:
95
- results = test(
96
- testing_config,
97
- image,
98
- class_name,
99
- concepts,
100
- cardinality,
101
- dataset_name,
102
- model_name,
103
- device=device,
104
- )
105
-
106
- st.session_state.tested = True
107
- st.session_state.results = results
108
- st.session_state.disabled = False
109
- st.experimental_rerun()
110
- else:
111
- _preload_results(image_name)
112
-
113
- with columns[1]:
114
- viz_results()
 
1
  import streamlit as st
 
2
 
3
+ from app_lib.about import about
4
+ from app_lib.demo import demo
 
 
 
 
 
 
 
5
 
6
 
7
+ def main():
8
+ demo_tab, about_tab = st.tabs(["Demo", "How Does it Work?"])
9
 
10
+ with demo_tab:
11
+ demo()
12
 
13
+ with about_tab:
14
+ about()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
assets/about/local_dist.jpg ADDED
assets/about/setup.jpg ADDED
header.md CHANGED
@@ -1,5 +1,5 @@
1
  # 🤔 I Bet You Did Not Mean That
2
 
3
- Test the effect of different concepts on the predictions of a classifier. Concepts are ranked by their *importance*: how much they change the prediction. [[paper]](https://arxiv.org/pdf/2405.19146) [[code]](https://github.com/Sulam-Group/IBYDMT)
4
-
5
 
 
 
1
  # 🤔 I Bet You Did Not Mean That
2
 
3
+ Test the effect of different concepts on the predictions of a classifier. Concepts are ranked by their *importance*: how much they change the prediction [[paper]](https://arxiv.org/pdf/2405.19146) [[code]](https://github.com/Sulam-Group/IBYDMT).
 
4
 
5
+ by [Jacopo Teneggi](https://jacopoteneggi.github.io) and [Jeremias Sulam](https://sites.google.com/view/jsulam) (Johns Hopkins University).