FaF Rebuttal 任务表
使用方式
直接在 Obsidian 中勾选每一项。P0 是 rebuttal 必须优先完成的内容,P1 是能显著加分的补充,P2 是文稿清晰度修正。
P0 总体主线
- 确认 rebuttal 字数限制、截止时间和是否允许补充实验结果。
- 写一段总回应:感谢审稿人,承认 AFB 借鉴 BRA,但强调贡献在于面向 face forgery detection 的问题诊断、任务适配、Focus and Forget 组合设计,以及系统验证。
- 把 rebuttal 主线定为:不是简单堆模块,而是分别解决 redundant global interactions 与 shortcut-driven learning。
- 准备一段解释为什么 sparse routing 适合伪造检测:伪造线索通常局部、细微,dense global attention 容易混入背景、头发、身份和衣物等非因果信息。
- 准备一段解释为什么平均 AUC 提升虽小但有意义:跨多个 unseen datasets 稳定提升,同时保持低参数和较高吞吐。
P0 必答问题
- 回应创新性质疑:说明 AFB/SFB 的单点机制并非完全从零提出,但本文贡献是 forensic-specific adapter 设计和互补机制验证。
- 回应 Forensics Adapter 增益较小:列出 frame-level CDF-v1、CDF-v2、DFD、Avg. 的具体提升,并说明 DFDC 基本持平但整体更稳。
- 回应 video-level 增益小:强调本文目标是 frame-based lightweight generalization,video-level 只是通过聚合得到,未引入额外 temporal model。
- 回应 SFB 假设缺少定量验证:补充或承诺加入 high-activation 区域与 shortcut/非因果区域的统计分析。
- 回应 curriculum schedule 手工化:说明已有消融比较 fixed soft、fixed hard、light mask、heavy mask,证明中等强度 progressive masking 更稳。
- 回应 DF40 缺 Forensics Adapter:优先补跑 Forensics Adapter 在 DF40 六类伪造上的结果;若来不及,明确承诺 camera-ready 加入。
一页 Rebuttal 精简总稿
原则:少解释机制,多引用新增实验;不要逐 reviewer 展开长答,按共同问题合并回应。
English compact draft
We thank all reviewers for recognizing the clear motivation, intuitive Focus-and-Forget design, strong cross-dataset results, and efficiency. We also appreciate the concerns about novelty, comparison with Forensics Adapter, the SFB assumption, and several presentation details.
First, our design process was driven by the observation that local inductive bias is important for deepfake detection. We initially explored pure convolutional adapters for CLIP, which already brought moderate improvements. We then transferred the idea of ParC-style position-aware circular convolution to inject local/positional inductive bias into the adapter. Following this empirical path, we were inspired by BiFormer’s content-aware sparse routing and developed the current adapter with task-specific implementation adjustments for deepfake detection. Thus, while the routing idea is inspired by BiFormer, our contribution is the overall Focus-and-Forget framework that combines adaptive artifact focusing with shortcut forgetting to address redundant global interactions and shortcut-driven learning in CLIP-based forensic adaptation.
For the concern about limited improvement over Forensics Adapter, we have now added the missing comparison on the more recent and challenging DF40 benchmark, which was not included in the original submission due to time constraints. On six representative face-swapping types, our method achieves 91.5 Avg. AUC, compared with 82.63 for Forensics Adapter, showing a clear advantage under more diverse manipulations. We also acknowledge that the average gain over a strong SOTA baseline is modest in the original cross-dataset table; however, when performance is already in a high-AUC regime, further improvements are naturally more difficult. Compared with our initial adapter baseline, the full model still brings a clear improvement.
For the sparse-routing design, we added two ablations. First, the region partition study shows that $S=4$ obtains the best Avg. AUC of 0.894, higher than both $S=2$ and $S=8$ (0.887), since overly coarse partitions still mix irrelevant context while overly fine partitions weaken useful cross-region dependency. Second, we compared different local/sparse attention variants: Local Attention obtains 0.887 Avg. AUC, Swin Shifted Window obtains 0.884, while our focusing design reaches 0.894. This suggests that fixed local/window attention is less suitable because it lacks content-adaptive region selection.
For the SFB assumption, we added masking comparisons. High-activation masking achieves 0.904 average AUC, clearly outperforming random activation masking (0.878) and low-activation masking (0.874). This supports that suppressing dominant high-response regions is more effective than generic masking.
We will also revise the manuscript to clarify Grad-CAM sources, manipulation categories in visualizations, the implementation mapping, video aggregation, and the frame-based limitation.
P1 补实验优先级
- 补 Forensics Adapter on DF40,对应 KnET 的 insufficient comparison。
- 补 SFB high-activation mask 定量分析,对应 XSa4 的核心质疑。
- 补 region partition size 消融,例如 S=2、S=4、S=8,对应 aCJ9。
- 补 sparse/local attention 替代比较,例如 window attention、shifted window attention、deformable attention 或简单 local attention,对应 GRmT。
- 整理 robustness 下 block-wise corruption 或 structured perturbation 的性能下降原因,对应 aCJ9。
- 如果时间允许,补 3 seeds 均值和方差,用于回应 small gain 的统计稳定性。
P1 Reviewer-by-Reviewer
GRmT
- 感谢其认可问题定义、curriculum masking 和 consistent improvements。
- 回应 incremental over Forensics Adapter:强调本文不是仅替换 attention,而是引入 spatial focusing + shortcut forgetting 的双机制。
- 回应 sparse attention 与 forgery 的联系:补充 localized artifacts、blending boundary、局部纹理不一致等任务特性。
- 回应与 Swin/deformable 等比较:补充实验或说明会加入更系统的 sparse/local attention 对比。已补实验
GRmT 精简回应草稿
中文回应
感谢审稿人的认可和建议。我们同意 BRA 本身来自 BiFormer;本文的贡献并不是提出一个新的 attention operator,而是针对 CLIP-based forgery adapter 的两个具体问题,即冗余全局交互和 shortcut-driven learning,设计 Focus-and-Forget 适配框架。
为回应 sparse/local attention 的问题,我们补充了新的消融实验。Region partition 消融显示,$S=4$ 取得最佳 Avg. AUC 0.894,高于 $S=2$ 和 $S=8$ 的 0.887。不同注意力机制对比中,Local Attention 为 0.887,Swin Shifted Window 为 0.884,而 Ours (Focus) 达到 0.894。说明有效性不是简单来自局部注意力,而是来自合适粒度下的内容自适应稀疏路由。
English Response
We thank the reviewer for the positive comments and helpful suggestions. We agree that BRA was introduced in BiFormer; our contribution is not a new attention operator, but a task-oriented Focus-and-Forget adapter that addresses two failure modes of CLIP-based forgery adapters: redundant global interactions and shortcut-driven learning.
To support the role of sparse routing, we added new ablations. The region partition study shows that $S=4$ achieves the best Avg. AUC of 0.894, higher than both $S=2$ and $S=8$ (0.887). We also compared alternative local/sparse attention variants: Local Attention obtains 0.887 Avg. AUC and Swin Shifted Window obtains 0.884, while Ours (Focus) reaches 0.894. This suggests that the gain is not from simply using local attention, but from content-aware sparse routing with an appropriate granularity.
M9p1
- 感谢其认可 Focus and Forget 概念、cross-dataset 结果和 6.97M 参数效率。
- 回应 frame-based 限制:承认未建模 temporal inconsistency,并说明本文目标是轻量 frame-level 泛化,可与 temporal modules 互补。
- 回应 handcrafted schedule:用 Table 6 的消融说明 schedule 并非脆弱调参,而是对 mask 强度和训练阶段做了系统比较。
M9p1 精简回应草稿
中文回应
感谢审稿人对 Focus-and-Forget 思路、跨数据集结果和参数效率的认可。关于 frame-based 限制,我们同意本文没有显式建模 temporal inconsistency,并会在修订稿中更清楚地说明这一点。我们的目标是提升轻量 frame-level CLIP adapter 的跨数据集泛化;即便不使用时序模块,本文仍以 6.97M 可训练参数取得了 91.6 video-level Avg. AUC。该框架也可以与 temporal modeling 互补。
关于 curriculum schedule 可能依赖手工调参的问题,我们补充强调现有消融结果:相比 fixed soft、fixed hard、light mask 和 heavy mask,本文的 progressive high-activation masking 取得最佳 Avg. AUC 0.904。这说明性能并非来自任意 masking 或偶然超参,而是来自逐步增强的 shortcut suppression。
English Response
We thank the reviewer for recognizing the Focus-and-Forget idea, cross-dataset performance, and parameter efficiency. We agree that our method is frame-based and does not explicitly model temporal inconsistency; we will clarify this limitation in the revision. Our goal is to improve lightweight frame-level CLIP adaptation for cross-dataset generalization. Even without temporal modeling, our method achieves 91.6 video-level Avg. AUC with only 6.97M trainable parameters, and it can be complementary to temporal modules.
Regarding the concern about a handcrafted curriculum, we further emphasize our ablations: compared with fixed soft, fixed hard, light mask, and heavy mask variants, the proposed progressive high-activation masking obtains the best Avg. AUC of 0.904. This indicates that the gain is not from arbitrary masking or incidental hyperparameters, but from progressively strengthened shortcut suppression.
XSa4
- 感谢其认可框架直观有趣。
- 重点回应 novelty:强调任务诊断、CLIP adapter 场景下的集成方式、训练期 forgetting regularizer 与推理期无额外 mask 的设计。
- 重点回应 SFB 假设:补充 high-activation 区域统计、mask 可视化,或 high-activation masking vs random masking 对比。已补实验
XSa4 精简回应草稿
中文回应
感谢审稿人认可 Focus-and-Forget 框架的直观性。关于 novelty,我们同意 AFB 和 SFB 分别借鉴了 sparse routing 与 activation masking 的思想;本文的贡献在于将二者针对 CLIP-based forgery adapter 的两个失效模式进行联合设计,并在同一框架下实现训练期 shortcut suppression 与推理期无额外 masking。
针对 “high activation 是否对应 shortcut cues” 的问题,我们补充了直接消融。High Activation Masking 取得 0.904 Avg. AUC,明显优于 Random Activation Masking 的 0.878 和 Low Activation Masking 的 0.874。这说明抑制高响应区域比随机遮蔽或低响应遮蔽更有效,支持 SFB 对 dominant shortcut responses 的建模假设。
English Response
We thank the reviewer for recognizing the intuitive Focus-and-Forget framework. Regarding novelty, we agree that AFB and SFB are inspired by sparse routing and activation masking, respectively. Our contribution is to jointly adapt them to two failure modes of CLIP-based forgery adapters, enabling training-time shortcut suppression while introducing no additional masking at inference.
To address the concern about whether high-activation locations correspond to shortcut cues, we added a direct ablation. High Activation Masking achieves 0.904 Avg. AUC, clearly outperforming Random Activation Masking (0.878) and Low Activation Masking (0.874). This supports our assumption that suppressing dominant high-response regions is more effective than generic masking for reducing shortcut dependence.
KnET
- 感谢其认可 motivation 和两个 limitation。
- 回应 limited novelty:强调针对 CLIP-based detector 的 forensic-specific failure modes,而不是通用视觉任务中的普通 sparse attention。
- 回应 Forensics Adapter 提升有限:用多指标和多场景说明稳定性,包括 AP、EER、DF40、robustness 和效率。
- 补 DF40 中 Forensics Adapter 缺失的比较结果,或在 rebuttal 中明确给出计划。已补实验
KnET 精简回应草稿
中文回应
感谢审稿人对 motivation 和两个 limitation 的认可。关于 novelty,我们同意本文并非提出全新的基础算子;我们的贡献是针对 CLIP-based forgery adapter 中的冗余全局交互和 shortcut-driven learning,进行任务导向的 Focus-and-Forget 设计。
关于相对 Forensics Adapter 的提升幅度,我们补充了审稿人指出缺失的 DF40 对比。在六类 face-swapping forgery 上,Forensics Adapter 的 Avg. AUC 为 82.63,而本文方法达到 91.5,提升 8.87。该结果说明,在更复杂的多伪造类型场景下,本文方法相对最相关 CLIP baseline 的优势更加明显。
English Response
We thank the reviewer for recognizing the motivation and the two limitations. Regarding novelty, we agree that our work does not propose a fundamentally new primitive operator. Our contribution is a task-oriented Focus-and-Forget design for two failure modes of CLIP-based forgery adapters: redundant global interactions and shortcut-driven learning.
Regarding the limited gain over Forensics Adapter, we added the missing DF40 comparison suggested by the reviewer. Across six representative face-swapping forgery types, Forensics Adapter obtains 82.63 Avg. AUC, while our method achieves 91.5, improving by 8.87. This shows that under more diverse manipulation types, our method brings a clearer gain over the most relevant CLIP-based baseline.
aCJ9
- 感谢其认可任务重要性和 Focus-and-Forget 技术合理性。
- 回应额外复杂度是否值得:列出 trainable params 从 5.77M 到 6.97M、throughput 从 65.21 到 59.60 img/s,说明开销较小。
- 修 Figure 3:标注每个样例的 manipulation category。
- 说明 Grad-CAM 来源:明确 baseline/proposed method 分别对应哪一列或哪张图。
- 解释 Implementation Details 中的 mapping
{1:1,2:8,3:15},如果正文不再使用就删除。 - 解释 block-wise corruption 或 structured perturbation 下性能下降原因。
- 补 region partition size ablation。已补实验
- 检查 Figure 2 caption:Artifact Focusing Block / Mechanism 术语统一。已修正
- 检查 Figure 1 中 V 到 KV 经过 AFM module 的图示是否正确。确认正确
aCJ9 精简回应草稿
中文回应
感谢审稿人认可本文任务的重要性和 Focus-and-Forget 设计的合理性。关于额外复杂度是否值得,我们将在修订稿中更清楚地强调效率-性能权衡:本文仅将可训练参数从 5.77M 增加到 6.97M,吞吐从 65.21 降至 59.60 img/s,仍保持轻量高效。
针对 region partition size,我们已补充消融实验:$S=4$ 取得最佳 Avg. AUC 0.894,高于 $S=2$ 和 $S=8$ 的 0.887。对于审稿人指出的表述问题,我们已统一 Figure 2 中 Artifact Focusing Block/Mechanism 的术语,并检查确认 Figure 1 中 V 到 KV 的示意无误。修订稿中也会进一步明确 manipulation category、Grad-CAM 来源、implementation mapping 以及 structured perturbation 下的表现说明。
English Response
We thank the reviewer for recognizing the importance of the task and the technical reasonableness of Focus-and-Forget. Regarding whether the added complexity is justified, we will better highlight the accuracy-efficiency trade-off in the revision: our method only increases trainable parameters from 5.77M to 6.97M and changes throughput from 65.21 to 59.60 img/s, while remaining lightweight and efficient.
For the region partition size, we added an ablation: $S=4$ achieves the best Avg. AUC of 0.894, outperforming both $S=2$ and $S=8$ (0.887). For presentation issues, we have unified the terminology of Artifact Focusing Block/Mechanism in Figure 2 and verified the V-to-KV illustration in Figure 1. We will also clarify the manipulation categories, Grad-CAM sources, implementation mapping, and the behavior under structured perturbations in the revision.
P2 正文修改清单
- 在 Introduction 或 Method 开头增加一句:本文不声称 BRA 本身原创,而是针对 forgery detection 的 localized evidence 和 shortcut bias 做任务适配。
- 在 Method 中明确 SFB 只在训练期启用,推理期关闭 masking。
- 在 Implementation Details 中写清楚 t 是 step,T 是 saturation step,p(t) 是触发概率,r(t) 是 keep ratio。
- 在实验部分补充 video-level aggregation 方法。
- 在 limitation 中保留 frame-based 限制,并说明未来可接 temporal modeling。
- 检查所有表格中 Forensics Adapter、Baseline Adapter、Baseline Re-impl. 的命名一致性。
- 检查所有 figure caption 是否准确指向 baseline/proposed method。
Rebuttal 草稿结构
- Opening:感谢审稿人,简述我们会澄清 novelty、补充实验与修正文稿。
- Point 1 Novelty:回应组合式创新和 forensic-specific 适配。
- Point 2 Effectiveness:回应相对 Forensics Adapter 的提升幅度与开销。
- Point 3 Mechanism Evidence:回应 sparse routing 和 SFB 假设。
- Point 4 Additional Experiments:列出 DF40 Forensics Adapter、partition size、attention alternative、SFB 定量验证。
- Point 5 Minor Clarifications:集中处理 Figure、Grad-CAM、mapping 和术语问题。