š Let me introduce the work I've done over the past three months: šš¹š®šŗš®-šÆ.š®-š§š®š¶šš®š»-šÆš and šš¹š®šŗš®-šÆ.š®-š§š®š¶šš®š»-šÆš-šš»šššæšš°š, now open-sourced on š¤ Hugging Face.
š¹š¶š®š»š“šµššš»/šš¹š®šŗš®-šÆ.š®-š§š®š¶šš®š»-šÆš: This model is built on top of šŗš²šš®-š¹š¹š®šŗš®/šš¹š®šŗš®-šÆ.š®-šÆš with continual pretraining. The training dataset consists of a mixture of Traditional Chinese and multilingual texts in specific proportions, including 20B tokens of Traditional Chinese text.
š¹š¶š®š»š“šµššš»/šš¹š®šŗš®-šÆ.š®-š§š®š¶šš®š»-šÆš-šš»šššæšš°š: This is a fine-tuned conversational model based on the foundation model.
This Llama-3.2-Taiwan open-source project is currently a one-person effort (yes, I did everything from text preparation ā so exhausting!). If you're interested, feel free to join the Discord server for discussions.
š ±š “š ½š ²š ·š ¼š °šš ŗš øš ½š ¶
The evaluation was conducted using ikala/tmmluplus, though the README page does not yet reflect the latest results. The performance is close to the previous versions, indicating that further improvements might require adding more specialized knowledge in the datasets.