Chen Junya, Xiu Zidi, Goldstein Benjamin A, Henao Ricardo, Carin Lawrence, Tao Chenyang
Duke University.
Adv Neural Inf Process Syst. 2021 Dec;34:21229-21243.
Dealing with severe class imbalance poses a major challenge for many real-world applications, especially when the accurate classification and generalization of minority classes are of primary interest. In computer vision and NLP, learning from datasets with long-tail behavior is a recurring theme, especially for naturally occurring labels. Existing solutions mostly appeal to sampling or weighting adjustments to alleviate the extreme imbalance, or impose inductive bias to prioritize generalizable associations. Here we take a novel perspective to promote sample efficiency and model generalization based on the invariance principles of causality. Our contribution posits a meta-distributional scenario, where the causal generating mechanism for label-conditional features is invariant across different labels. Such causal assumption enables efficient knowledge transfer from the dominant classes to their under-represented counterparts, even if their feature distributions show apparent disparities. This allows us to leverage a causal data augmentation procedure to enlarge the representation of minority classes. Our development is orthogonal to the existing imbalanced data learning techniques thus can be seamlessly integrated. The proposed approach is validated on an extensive set of synthetic and real-world tasks against state-of-the-art solutions.
处理严重的类别不平衡问题对许多实际应用构成了重大挑战,尤其是当少数类别的准确分类和泛化是主要关注点时。在计算机视觉和自然语言处理中,从具有长尾行为的数据集中学习是一个反复出现的主题,特别是对于自然出现的标签。现有的解决方案大多诉诸于采样或权重调整来缓解极端不平衡,或者施加归纳偏差以优先考虑可泛化的关联。在这里,我们从一个新颖的角度出发,基于因果关系的不变性原则来提高样本效率和模型泛化能力。我们的贡献提出了一种元分布场景,其中标签条件特征的因果生成机制在不同标签之间是不变的。这种因果假设能够实现从主导类别到其代表性不足的对应类别的有效知识转移,即使它们的特征分布存在明显差异。这使我们能够利用因果数据增强过程来扩大少数类别的表示。我们的方法与现有的不平衡数据学习技术正交,因此可以无缝集成。所提出的方法在一系列广泛的合成和实际任务中针对最先进的解决方案进行了验证。