1 7 4

Kamarul Adha

Hiraishin

AI & ML interests

None yet

Recent Activity

upvoted a paper 14 days ago

Large Malaysian Language Model Based on Mistral for Enhanced Local Language Understanding

upvoted a paper 14 days ago

MaLLaM -- Malaysia Large Language Model

upvoted a paper 14 days ago

MMMModal -- Multi-Images Multi-Audio Multi-turn Multi-Modal

View all activity

Organizations

Hiraishin's activity

upvoted 4 papers 14 days ago

Large Malaysian Language Model Based on Mistral for Enhanced Local Language Understanding

Paper • 2401.13565 • Published Jan 24, 2024 • 4

MaLLaM -- Malaysia Large Language Model

Paper • 2401.14680 • Published Jan 26, 2024 • 2

MMMModal -- Multi-Images Multi-Audio Multi-turn Multi-Modal

Paper • 2402.11297 • Published Feb 17, 2024 • 2

Adapting Safe-for-Work Classifier for Malaysian Language Text: Enhancing Alignment in LLM-Ops Framework

Paper • 2407.20729 • Published Jul 30, 2024 • 27

upvoted an article 3 months ago

Article

Efficient LLM Pretraining: Packed Sequences and Masked Attention

•

Oct 7, 2024

• 14

updated 2 datasets 4 months ago

Knowledge-Innovation-Centre/ESCO-Syn-Data

Updated Oct 16, 2024 • 43

Hiraishin/kotaksakti-demo

Updated Oct 10, 2024 • 34

updated 2 datasets 6 months ago

Hiraishin/RHB-PDF

Viewer • Updated Aug 25, 2024 • 957 • 47

Hiraishin/RHB

Updated Aug 24, 2024 • 51

liked a model 7 months ago

microsoft/Florence-2-large

Image-Text-to-Text • Updated Dec 8, 2024 • 635k • 1.39k

updated 2 datasets 8 months ago

userdatamodel/FUNSD_XFUND

Preview • Updated Jun 27, 2024 • 7

Hiraishin/Devtalk-Embeddings

Viewer • Updated Jun 23, 2024 • 858 • 47

updated 2 models 8 months ago

userdatamodel/Geolayout-Finetuned-FUNSD-BS6-BS24

Updated Jun 18, 2024

Hiraishin/PaliGemma

Updated Jun 3, 2024

updated 3 datasets 9 months ago

reacted to chansung's post with 🔥 9 months ago

Post

4019

🦙🦙 LLaMA Duo project update

Last time, I gave a brief introduction about LLaMA Duo project with @sayakpaul . It is a simple toolset to aligning sLLM with service LLM with coverage dataset 👉🏻 (https://huggingface.co/posts/chansung/708646454991943).
- coverage dataset is what we believe to be the most important/desired (instruction, response) pairs. In system thinking, each instruction could be an analogy of a function from traditional programming. We make unit tests and measure the coverage % for all the features/functions. Similarly, we need to ensure if our fine-tuned model could handle what % of given instructions from coverage dataset satisfactory (hence coverage dataset).

We have tested it with "Coding" category of data from HuggingFaceH4/no_robots dataset. It has about 300 SFT training data points under Coding category. After fine-tuning Gemma 7B model on that, the result was very poor. LLaMA Duo's evaluation tool gave < 20% of metrics in similarity and preciseness on the test split.

So, we used LLaMA Duo's synthetic data generation tool to generate 60k data points that looks similar to the original dataset. We first created ~10k synthetic data points, then created 50k more based on the synthetic dataset itself.

After fine-tuning Gemma 7B on the 60k synthetic dataset, the evaluation result went up to 80~90% high. Also, when testing out the model in UI, it tends to give good responses.

It is a good showcase to transition from service LLM to sLLM or having a backup sLLM for service LLM failure scenarios. I am going to expand this experiments on all categories of no_robots dataset. It will roughly generate > 100k data points.

Here are some links:
- LLaMA Duo project repo: https://github.com/deep-diver/llamaduo
- 60k Coding synthetic dataset: chansung/merged_ds_coding
- Fine-tuned Gemma 7B model: chansung/coding_llamaduo_60k_v0.2

updated a model 10 months ago

malaysia-ai/YOLOv8X-DocLayNet-Full-1024-42

Updated Apr 9, 2024 • 1

upvoted a paper 11 months ago

Multi-Lingual Malaysian Embedding: Leveraging Large Language Models for Semantic Representations

Paper • 2402.03053 • Published Feb 5, 2024 • 2