Add chinese inference example

fc40971 over 1 year ago

7.81 kB

	---
	tags:
	- generated_from_trainer
	- ner
	- named-entity-recognition
	- span-marker
	model-index:
	- name: span-marker-bert-base-multilingual-cased-multinerd
	results:
	- task:
	type: token-classification
	name: Named Entity Recognition
	dataset:
	type: Babelscape/multinerd
	name: MultiNERD
	split: test
	revision: 2814b78e7af4b5a1f1886fe7ad49632de4d9dd25
	metrics:
	- type: f1
	value: 0.9261
	name: F1
	- type: precision
	value: 0.9242
	name: Precision
	- type: recall
	value: 0.9281
	name: Recall
	license: apache-2.0
	datasets:
	- Babelscape/multinerd
	metrics:
	- precision
	- recall
	- f1
	pipeline_tag: token-classification
	language:
	- de
	- en
	- es
	- fr
	- it
	- nl
	- pl
	- pt
	- ru
	- zh
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# span-marker-bert-base-multilingual-cased-multinerd

	This model is a fine-tuned version of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) on an [Babelscape/multinerd](https://huggingface.co/datasets/Babelscape/multinerd) dataset.
	It achieves the following results on the test set:
	- Loss: 0.0049
	- Overall Precision: 0.9242
	- Overall Recall: 0.9281
	- Overall F1: 0.9261
	- Overall Accuracy: 0.9852


	This is a replication of Tom's work. Everything remains unchanged,
	except that we extended the number of training epochs to 3 for a
	slightly longer training duration and set the gradient_accumulation_steps to 2.
	Please refer to the official [model page](https://huggingface.co/tomaarsen/span-marker-mbert-base-multinerd) to review their results and training script

	## Label set

	\| Class \| Description \| Examples \|
	\|-------\|-------------\|----------\|
	\| PER (person) \| People \| Ray Charles, Jessica Alba, Leonardo DiCaprio, Roger Federer, Anna Massey. \|
	\| ORG (organization) \| Associations, companies, agencies, institutions, nationalities and religious or political groups \| University of Edinburgh, San Francisco Giants, Google, Democratic Party. \|
	\| LOC (location) \| Physical locations (e.g. mountains, bodies of water), geopolitical entities (e.g. cities, states), and facilities (e.g. bridges, buildings, airports). \| Rome, Lake Paiku, Chrysler Building, Mount Rushmore, Mississippi River. \|
	\| ANIM (animal) \| Breeds of dogs, cats and other animals, including their scientific names. \| Maine Coon, African Wild Dog, Great White Shark, New Zealand Bellbird. \|
	\| BIO (biological) \| Genus of fungus, bacteria and protoctists, families of viruses, and other biological entities. \| Herpes Simplex Virus, Escherichia Coli, Salmonella, Bacillus Anthracis. \|
	\| CEL (celestial) \| Planets, stars, asteroids, comets, nebulae, galaxies and other astronomical objects. \| Sun, Neptune, Asteroid 187 Lamberta, Proxima Centauri, V838 Monocerotis. \|
	\| DIS (disease) \| Physical, mental, infectious, non-infectious, deficiency, inherited, degenerative, social and self-inflicted diseases. \| Alzheimer’s Disease, Cystic Fibrosis, Dilated Cardiomyopathy, Arthritis. \|
	\| EVE (event) \| Sport events, battles, wars and other events. \| American Civil War, 2003 Wimbledon Championships, Cannes Film Festival. \|
	\| FOOD (food) \| Foods and drinks. \| Carbonara, Sangiovese, Cheddar Beer Fondue, Pizza Margherita. \|
	\| INST (instrument) \| Technological instruments, mechanical instruments, musical instruments, and other tools. \| Spitzer Space Telescope, Commodore 64, Skype, Apple Watch, Fender Stratocaster. \|
	\| MEDIA (media) \| Titles of films, books, magazines, songs and albums, fictional characters and languages. \| Forbes, American Psycho, Kiss Me Once, Twin Peaks, Disney Adventures. \|
	\| PLANT (plant) \| Types of trees, flowers, and other plants, including their scientific names. \| Salix, Quercus Petraea, Douglas Fir, Forsythia, Artemisia Maritima. \|
	\| MYTH (mythological) \| Mythological and religious entities. \| Apollo, Persephone, Aphrodite, Saint Peter, Pope Gregory I, Hercules. \|
	\| TIME (time) \| Specific and well-defined time intervals, such as eras, historical periods, centuries, years and important days. No months and days of the week. \| Renaissance, Middle Ages, Christmas, Great Depression, 17th Century, 2012. \|
	\| VEHI (vehicle) \| Cars, motorcycles and other vehicles. \| Ferrari Testarossa, Suzuki Jimny, Honda CR-X, Boeing 747, Fairey Fulmar. \|



	## Inference Example

	```python
	# install span_marker
	(env)$ pip install span_marker


	from span_marker import SpanMarkerModel

	model = SpanMarkerModel.from_pretrained("lxyuan/span-marker-bert-base-multilingual-cased-multinerd")

	description = "Singapore is renowned for its hawker centers offering dishes \
	like Hainanese chicken rice and laksa, while Malaysia boasts dishes such as \
	nasi lemak and rendang, reflecting its rich culinary heritage."

	entities = model.predict(description)

	entities
	>>>
	[
	{'span': 'Singapore', 'label': 'LOC', 'score': 0.999988317489624, 'char_start_index': 0, 'char_end_index': 9},
	{'span': 'Hainanese chicken rice', 'label': 'FOOD', 'score': 0.9894770383834839, 'char_start_index': 66, 'char_end_index': 88},
	{'span': 'laksa', 'label': 'FOOD', 'score': 0.9224908947944641, 'char_start_index': 93, 'char_end_index': 98},
	{'span': 'Malaysia', 'label': 'LOC', 'score': 0.9999839067459106, 'char_start_index': 106, 'char_end_index': 114}]

	# missed: nasi lemak as FOOD
	# missed: rendang as FOOD
	# :(
	```

	#### Quick test on Chinese
	```python
	from span_marker import SpanMarkerModel

	model = SpanMarkerModel.from_pretrained("lxyuan/span-marker-bert-base-multilingual-cased-multinerd")

	# translate to chinese
	description = "Singapore is renowned for its hawker centers offering dishes \
	like Hainanese chicken rice and laksa, while Malaysia boasts dishes such as \
	nasi lemak and rendang, reflecting its rich culinary heritage."

	zh_description = "新加坡因其小贩中心提供海南鸡饭和叻沙等菜肴而闻名, 而马来西亚则拥有椰浆饭和仁当等菜肴，反映了其丰富的烹饪传统."

	entities = model.predict(zh_description)

	entities
	>>>
	[
	{'span': '新加坡', 'label': 'LOC', 'score': 0.9282007813453674, 'char_start_index': 0, 'char_end_index': 3},
	{'span': '马来西亚', 'label': 'LOC', 'score': 0.7439665794372559, 'char_start_index': 27, 'char_end_index': 31}]

	# It only managed to capture two countries: Singapore and Malaysia.
	# All other entities were missed out.
	```


	## Training procedure

	One can reproduce the result running this [script](https://huggingface.co/tomaarsen/span-marker-mbert-base-multinerd/blob/main/train.py)

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 32
	- eval_batch_size: 32
	- seed: 42
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 64
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_ratio: 0.1
	- num_epochs: 3

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Overall Precision \| Overall Recall \| Overall F1 \| Overall Accuracy \|
	\|:-------------:\|:-----:\|:------:\|:---------------:\|:-----------------:\|:--------------:\|:----------:\|:----------------:\|
	\| 0.0129 \| 1.0 \| 50436 \| 0.0042 \| 0.9226 \| 0.9169 \| 0.9197 \| 0.9837 \|
	\| 0.0027 \| 2.0 \| 100873 \| 0.0043 \| 0.9255 \| 0.9206 \| 0.9230 \| 0.9846 \|
	\| 0.0015 \| 3.0 \| 151308 \| 0.0049 \| 0.9242 \| 0.9281 \| 0.9261 \| 0.9852 \|


	### Framework versions

	- Transformers 4.30.2
	- Pytorch 2.0.1+cu117
	- Datasets 2.14.3
	- Tokenizers 0.13.3