wmt22-cometkiwi-da / README.md
Ben Nguyen
Inference endpoint
7ffa8af
|
raw
history blame
5 kB
metadata
extra_gated_heading: Acknowledge license to accept the repository
extra_gated_button_content: Acknowledge license
pipeline_tag: translation
language:
  - multilingual
  - af
  - am
  - ar
  - as
  - az
  - be
  - bg
  - bn
  - br
  - bs
  - ca
  - cs
  - cy
  - da
  - de
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - fi
  - fr
  - fy
  - ga
  - gd
  - gl
  - gu
  - ha
  - he
  - hi
  - hr
  - hu
  - hy
  - id
  - is
  - it
  - ja
  - jv
  - ka
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lo
  - lt
  - lv
  - mg
  - mk
  - ml
  - mn
  - mr
  - ms
  - my
  - ne
  - nl
  - 'no'
  - om
  - or
  - pa
  - pl
  - ps
  - pt
  - ro
  - ru
  - sa
  - sd
  - si
  - sk
  - sl
  - so
  - sq
  - sr
  - su
  - sv
  - sw
  - ta
  - te
  - th
  - tl
  - tr
  - ug
  - uk
  - ur
  - uz
  - vi
  - xh
  - yi
  - zh
license: cc-by-nc-sa-4.0

This is a COMET quality estimation model by Unbabel: It receives a source sentence and the respective translation and returns a score that reflects the quality of the translation.

Paper

CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task (Rei et al., WMT 2022)

License:

cc-by-nc-sa-4.0

Usage for Inference Endpoint

import json
import requests

API_URL = ""
API_TOKEN="MY_API_KEY"
headers = {
    "Authorization": f"Bearer {API_TOKEN}",
    "Content-Type": "application/json",
}

def query(url, headers, payload):
    data = json.dumps(payload)
    response = requests.request("POST", url, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

payload = {
    "inputs": {
        "batch_size": 8,
        "workers": None,
        "data": [
            {
                "src": "Youll be picking fruit and generally helping us do all the usual farm work",
                "mt": "๋‹น์‹ ์€ ๊ณผ์ผ์„ ๋”ฐ๊ธฐ๋„ ํ•˜๊ณ  ๋Œ€์ฒด๋กœ ์šฐ๋ฆฌ๊ฐ€ ํ•˜๋Š” ์ผ์ƒ์ ์ธ ๋†์žฅ ์ผ์„ ๋•๊ฒŒ ๋  ๊ฒ๋‹ˆ๋‹ค",
            },{
                "src": "Youll be picking fruit and generally helping us do all the usual farm work",
                "mt": "๋‹น์‹ ์€ ๊ณผ์ผ์„ ๋”ฐ๊ธฐ๋„ ํ•˜๊ณ  ๋Œ€์ฒด๋กœ ์šฐ๋ฆฌ๊ฐ€ ํ•˜๋Š” ์ผ์ƒ์ ์ธ ๋†์žฅ ์ผ์„ ๋•๊ฒŒ ๋  ๊ฒ๋‹ˆ๋‹ค",
            },{
                "src": "Youll be picking fruit and generally helping us do all the usual farm work",
                "mt": "๋‹น์‹ ์€ ๊ณผ์ผ์„ ๋”ฐ๊ธฐ๋„ ํ•˜๊ณ  ๋Œ€์ฒด๋กœ ์šฐ๋ฆฌ๊ฐ€ ํ•˜๋Š” ์ผ์ƒ์ ์ธ ๋†์žฅ ์ผ์„ ๋•๊ฒŒ ๋  ๊ฒ๋‹ˆ๋‹ค",
            },{
                "src": "Youll be picking fruit and generally helping us do all the usual farm work",
                "mt": "๋‹น์‹ ์€ ๊ณผ์ผ์„ ๋”ฐ๊ธฐ๋„ ํ•˜๊ณ  ๋Œ€์ฒด๋กœ ์šฐ๋ฆฌ๊ฐ€ ํ•˜๋Š” ์ผ์ƒ์ ์ธ ๋†์žฅ ์ผ์„ ๋•๊ฒŒ ๋  ๊ฒ๋‹ˆ๋‹ค",
            },{
                "src": "Youll be picking fruit and generally helping us do all the usual farm work",
                "mt": "๋‹น์‹ ์€ ๊ณผ์ผ์„ ๋”ฐ๊ธฐ๋„ ํ•˜๊ณ  ๋Œ€์ฒด๋กœ ์šฐ๋ฆฌ๊ฐ€ ํ•˜๋Š” ์ผ์ƒ์ ์ธ ๋†์žฅ ์ผ์„ ๋•๊ฒŒ ๋  ๊ฒ๋‹ˆ๋‹ค",
            },{
                "src": "Youll be picking fruit and generally helping us do all the usual farm work",
                "mt": "๋‹น์‹ ์€ ๊ณผ์ผ์„ ๋”ฐ๊ธฐ๋„ ํ•˜๊ณ  ๋Œ€์ฒด๋กœ ์šฐ๋ฆฌ๊ฐ€ ํ•˜๋Š” ์ผ์ƒ์ ์ธ ๋†์žฅ ์ผ์„ ๋•๊ฒŒ ๋  ๊ฒ๋‹ˆ๋‹ค",
            },{
                "src": "Youll be picking fruit and generally helping us do all the usual farm work",
                "mt": "๋‹น์‹ ์€ ๊ณผ์ผ์„ ๋”ฐ๊ธฐ๋„ ํ•˜๊ณ  ๋Œ€์ฒด๋กœ ์šฐ๋ฆฌ๊ฐ€ ํ•˜๋Š” ์ผ์ƒ์ ์ธ ๋†์žฅ ์ผ์„ ๋•๊ฒŒ ๋  ๊ฒ๋‹ˆ๋‹ค",
            },
        ]
    }
}

scores = query(API_URL, headers, payload)

Intended uses

Unbabel's model is intented to be used for reference-free MT evaluation.

Given a source text and its translation, outputs a single score between 0 and 1 where 1 represents a perfect translation.

Languages Covered:

This model builds on top of InfoXLM which cover the following languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.

Thus, results for language pairs containing uncovered languages are unreliable!