prithivida commited on
Commit
e922b4d
·
1 Parent(s): 165d869

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +95 -0
README.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Parrot
2
+
3
+ ## 1. What is Parrot?
4
+ Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models. A paraphrase framework is more than just a paraphrasing model.
5
+
6
+ ## 2. Why Parrot?
7
+ **Huggingface** lists [12 paraphrase models,](https://huggingface.co/models?pipeline_tag=text2text-generation&search=paraphrase) **RapidAPI** lists 7 fremium and commercial paraphrasers like [QuillBot](https://rapidapi.com/search/paraphrase?section=apis&page=1), Rasa has discussed an experimental paraphraser for augmenting text data [here](https://forum.rasa.com/t/paraphrasing-for-nlu-data-augmentation-experimental/27744), Sentence-transfomers offers a [paraphrase mining utility](https://www.sbert.net/examples/applications/paraphrase-mining/README.html) and [NLPAug](https://github.com/makcedward/nlpaug) offers word level augmentation with a [PPDB](http://paraphrase.org/#/download) (a multi-million paraphrase database). While these attempts at paraphrasing are great, there are still some gaps and paraphrasing is NOT yet a mainstream option for text augmentation in building NLU models....Parrot is a humble attempt to fill some of these gaps.
8
+
9
+ **What is a good paraphrase?** Almost all conditioned text generation models are validated on 2 factors, (1) if the generated text conveys the same meaning as the original context (Adequacy) (2) if the text is fluent / grammatically correct english (Fluency). For instance Neural Machine Translation outputs are tested for Adequacy and Fluency. But [a good paraphrase](https://www.aclweb.org/anthology/D10-1090.pdf) should be adequate and fluent while being as different as possible on the surface lexical form. With respect to this definition, the **3 key metrics** that measures the quality of paraphrases are:
10
+ - **Adequacy** (Is the meaning preserved adequately?)
11
+ - **Fluency** (Is the paraphrase fluent English?)
12
+ - **Diversity (Lexical / Phrasal / Syntactical)** (How much has the paraphrase changed the original sentence?)
13
+
14
+ *Parrot offers knobs to control Adequacy, Fluency and Diversity as per your needs.*
15
+
16
+ **What makes a paraphraser a good augmentor?** For training a NLU model we just don't need a lot of utterances but utterances with intents and slots/entities annotated. Typical flow would be:
17
+ - Given an **input utterance + input annotations** a good augmentor spits out N **output paraphrases** while preserving the intent and slots.
18
+ - The output paraphrases are then converted into annotated data using the input annotations that we got in step 1.
19
+ - The annotated data created out of the output paraphrases then makes the training dataset for your NLU model.
20
+
21
+ But in general being a generative model paraphrasers doesn't guarantee to preserve the slots/entities. So the ability to generate high quality paraphrases in a constrained fashion without trading off the intents and slots for lexical dissimilarity makes a paraphraser a good augmentor. *More on this in section 3 below*
22
+
23
+ ### Installation
24
+ ```python
25
+ pip install parrot
26
+ ```
27
+
28
+ ### Quickstart
29
+ ```python
30
+
31
+ import warnings
32
+ warnings.filterwarnings("ignore")
33
+ parrot = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5", use_gpu=True)
34
+ phrases = ["Can you recommed some upscale restaurants in Rome?",
35
+ "What are the famous places we should not miss in Russia?"
36
+ ]
37
+
38
+ for phrase in phrases:
39
+ print("-"*100)
40
+ print(phrase)
41
+ print("-"*100)
42
+ para_phrases = parrot.augment(input_phrase=phrase)
43
+ for para_phrase in para_phrases:
44
+ print(para_phrase)
45
+ ```
46
+
47
+ <pre>
48
+
49
+
50
+ -----------------------------------------------------------------------------
51
+ Input_phrase: Can you recommed some upscale restaurants in Rome?
52
+ -----------------------------------------------------------------------------
53
+ "which upscale restaurants are recommended in rome?"
54
+ "which are the best restaurants in rome?"
55
+ "are there any upscale restaurants near rome?"
56
+ "can you recommend a good restaurant in rome?"
57
+ "can you recommend some of the best restaurants in rome?"
58
+ "can you recommend some best restaurants in rome?"
59
+ "can you recommend some upscale restaurants in rome?"
60
+ -----------------------------------------------------------------------------
61
+ Input_phrase: What are the famous places we should not miss in Russia
62
+ -----------------------------------------------------------------------------
63
+ "which are the must do places for tourists to visit in russia?"
64
+ "what are the best places to visit in russia?"
65
+ "what are some of the most visited sights in russia?"
66
+ "what are some of the most beautiful places in russia that tourists should not miss?"
67
+ "which are some of the most beautiful places to visit in russia?"
68
+ "what are some of the most important places to visit in russia?"
69
+ "what are some of the most famous places of russia?"
70
+ "what are some places we should not miss in russia?"
71
+
72
+ </pre>
73
+
74
+ ### Knobs
75
+
76
+ ```python
77
+
78
+ para_phrases = parrot.augment(input_phrase=phrase,
79
+ diversity_ranker="levenshtein",
80
+ do_diverse=False,
81
+ max_return_phrases = 10,
82
+ max_length=32,
83
+ adequacy_threshold = 0.99,
84
+ fluency_threshold = 0.90)
85
+
86
+ ```
87
+
88
+
89
+ ## 3. Scope
90
+
91
+ In the space of conversational engines, knowledge bots are to which **we ask questions** like *"when was the Berlin wall teared down?"*, transactional bots are to which **we give commands** like *"Turn on the music please"* and voice assistants are the ones which can do both answer questions and action our commands. Parrot mainly foucses on augmenting texts typed-into or spoken-to conversational interfaces for building robust NLU models. (*So usually people neither type out or yell out long paragraphs to conversational interfaces. Hence the pre-trained model is trained on text samples of maximum length of 64.*)
92
+
93
+ *While Parrot predominantly aims to be a text augmentor for building good NLU models, it can also be used as a pure-play paraphraser.*
94
+
95
+