--- library_name: transformers tags: - paraphraser license: mit pipeline_tag: summarization --- # Model Card for Model ID [Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense](https://arxiv.org/pdf/2303.13408.pdf) proposed a strong discourse paraphraser known as DIPPER. DIPPER is a large model, built from [google/t5-efficient-xxl](https://huggingface.co/google/t5-efficient-xxl) and finetuned on 6.3M datapoints. I am proposing a lightweight, non-context equivalent for lower-cost usage. This model is built from [google/t5-large-nl32](https://huggingface.co/google/t5-efficient-large-nl32) and finetuned on 100,000 datapoints. Notably, the datapoints are all non-context. Refer to the original paper if you wish for further understanding on this topic. The dataset used to finetune this model is available here: [Dataset](https://huggingface.co/datasets/SamSJackson/kpar3-no-ctx) ## Model Details ### Model Description This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - **Developed by:** Sam Jackson - **Model type:** Sequence-to-Sequence Model - **Language(s) (NLP):** English - **License:** MIT - **Finetuned from model [optional]:** [google/t5-efficient-large-nl32](https://huggingface.co/google/t5-efficient-large-nl32) ### Model Sources [optional] - **Repository:** [Original Github](https://github.com/martiansideofthemoon/ai-detection-paraphrases) - **Paper [optional]:** [Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense](https://arxiv.org/pdf/2303.13408.pdf) ## Uses The model is intended to be used for paraphrasing with notions of control. The dataset used encourages lexical (word) and order (paragraph structure) parameters, which control the degree of strength in paraphrasing. See the example code usage for a further understanding. ### Direct Use The model is entirely usable from the uploaded state. No further finetuning is required, although possible. ### Downstream Use [optional] This model was finetuned from a T5 checkpoint. It is possible to further finetune this model, if desired. If you plan for transfer learning, I would simply recommend starting from the initial checkpoint model: [google/t5-large-nl32](https://huggingface.co/google/t5-efficient-large-nl32). ### Recommendations In terms of recommendation, if you have the capacity, I would recommend using the more powerful model: [DIPPER](https://github.com/martiansideofthemoon/ai-detection-paraphrases) Otherwise, this model is sufficiently strong. It outperforms the sentence-based paraphraser [ChatGPT Paraphraser](https://huggingface.co/humarin/chatgpt_paraphraser_on_T5_base) when it comes to perplexity scores - when both models are compared using the facebook/opt-2.7b model. ## How to Get Started with the Model Use the code below to get started with the model. ## Training Details ### Training Data As mentioned, the training data is here: [kpar3-no-ctx](https://huggingface.co/datasets/SamSJackson/kpar3-no-ctx) Pre-processing simply contains tokenisation through the google/t5-efficient-large-nl32 tokenizer. The data is classic paraphrase pairs. However, the first element in the pair has terms "lexical = x" and "order = y". The values x and y are in the set {0, 20, 40, 60, 80, 100} and denote the strength with which the model should paraphrase. In particular, a sentence with "lexical = 0" should change as many words as possible, while maintaining the original meaning. Meanwhile, a sentence with "order = 0" should restructure the paragraph to the model's greatest extent. The dataset only contains parameter values in increments of 20. #### Training Hyperparameters - **Training regime:** ```python learning_rate = 1e-4 bf16 = True num_train_epochs = 2 auto_find_batch_size = True, generation_num_beams = 2, generation_max_length = 200 ``` #### Speeds, Sizes, Times [optional] Finetuning on 100,000 datapoints, this took around 14 GPU hours using a GTX 3090. ### Example Usage ```python import torch from transformers import AutoTokenizer, AutoModelForSeq2SeqLM device = torch.device("cuda" if torch.cuda.is_available() else "cpu") tokenizer = AutoTokenizer.from_pretrained("google/t5-efficient-large-nl32") model = AutoModelForSeq2SeqLM.from_pretrained("SamSJackson/paraphrase-dipper-no-ctx") model = model.to(device) text = "Each Wednesdsay, I take my dog for a walk in Central Park." lexical = 20 order = 40 prompt = f"lexical = {lexical}, order = {order} {text}" input_ids = tokenizer( prompt, return_tensors='pt', padding="longest", max_length=1000, truncation=True, ).to(device) outputs = model.generate( **input_ids, top_p=0.75, do_sample=True, max_new_tokens=300, ) response = tokenizer.batch_decode(outputs, skip_special_tokens=True) response = f"{' '.join(response)}" print(response) ``` ## Citation [optional] **BibTeX:** ``` @misc{krishna2023paraphrasing, title={Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense}, author={Kalpesh Krishna and Yixiao Song and Marzena Karpinska and John Wieting and Mohit Iyyer}, year={2023}, eprint={2303.13408}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ## Model Card Contact Contact me through huggingface if you have any questions.