unsupervised-semantic-diff / description.md
jvamvas's picture
Update citation
3f0c0dd
## Small print
<p style="background-color: #fff9f9; border: 1px solid #ff0000; padding: 10px;">
Warning: This demo is highly experimental and not ready for production use.
</p>
This demo is a proof of concept for visualizing the semantic differences between two text documents.
The input documents may or may not be written in the same language.
In our paper, we evaluate three simple, unsupervised approaches based on BERT-like encoder models.
This demo implements the approaches `DiffAlign` and `DiffDel` using the model [ZurichNLP/unsup-simcse-xlm-roberta-base](https://huggingface.co/ZurichNLP/unsup-simcse-xlm-roberta-base). See the model tags for a list of the ~100 supported languages.
- `DiffAlign` aligns the words of the two documents using cosine similarity between the word embeddings (cf. [SimAlign](http://dx.doi.org/10.18653/v1/2020.findings-emnlp.147), [BERTScore](https://openreview.net/forum?id=SkeHuCVFDr)). Words with low similarity are highlighted.
- `DiffDel` calculates sentence similarity between the two input documents (cf. [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552)). The algorithm highlights words whose deletion has a positive effect on the similarity score.
More resources:
- Paper: https://arxiv.org/abs/2305.13303
- Code: https://github.com/ZurichNLP/recognizing-semantic-differences
## Citation
```bibtex
@inproceedings{vamvas-sennrich-2023-rsd,
title={Towards Unsupervised Recognition of Token-level Semantic Differences in Related Documents},
author={Jannis Vamvas and Rico Sennrich},
month = dec,
year = "2023",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
address = "Singapore",
publisher = "Association for Computational Linguistics",
}
```