Emin Temiz PRO

etemiz

AI & ML interests

Alignment

Recent Activity

Organizations

None yet

etemiz's activity

posted an update about 20 hours ago
view post
Post
513
Benchmarked QwQ for the AHA Leaderboard. Compared to Qwen 2.5 knows nutrition and fasting better but lacks faith.

  • 1 reply
ยท
posted an update 7 days ago
posted an update 12 days ago
view post
Post
544
https://www.youtube.com/watch?v=EMyAGuHnDHk

In the video above some LLMs favored atheist, some favored the believer. In the picture below, atheist favoring LLMs are on the left, believer favoring LLMs are on the right.

The ones on the left are also lower ranking in my leaderboard and the ones on the right are higher ranking. My leaderboard:
https://sheet.zohopublic.com/sheet/published/mz41j09cc640a29ba47729fed784a263c1d08

Coincidence? My leaderboard has more domains. Does ranking high in faith mean ranking high in healthy living, nutrition, bitcoin and nostr on average?
reacted to clem's post with ๐Ÿ‘ 16 days ago
view post
Post
2801
What are the best organizations to follow on @huggingface ?

On top of my head:
- Deepseek (35,000 followers): https://huggingface.co/deepseek-ai
- Meta Llama (27,000 followers): https://huggingface.co/meta-llama
- Black Forrest Labs (11,000 followers): https://huggingface.co/black-forest-labs
- OpenAI (5,000 followers): https://huggingface.co/openai
- Nvidia (16,000 followers): https://huggingface.co/nvidia
- MIcrosoft (9,000 followers): https://huggingface.co/microsoft
- AllenAI (2,000 followers): https://huggingface.co/allenai
- Mistral (5,000 followers): https://huggingface.co/mistralai
- XAI (600 followers): https://huggingface.co/xai-org
- Stability AI (16,000 followers): https://huggingface.co/stabilityai
- Qwen (16,000 followers): https://huggingface.co/Qwen
- GoogleAI (8,000 followers): https://huggingface.co/google
- Unsloth (3,000 followers): https://huggingface.co/unsloth
- Bria AI (4,000 followers): https://huggingface.co/briaai
- NousResearch (1,300 followers): https://huggingface.co/NousResearch

Bonus, the agent course org with 17,000 followers: https://huggingface.co/agents-course
  • 1 reply
ยท
posted an update 16 days ago
view post
Post
1790
--- AHA Leaderboard ---

We all want AI to be properly aligned so it benefits humans with every answer it generates. While there are tremendous research around this and so many people working on it, I am choosing another route: Curation of people and then curation of datasets that are used in the LLM training. Curation of datasets comprising of people who try to uplift humanity should result in LLMs that try to help humans.

This work has revolved around two tasks:

1. Making LLMs that are benefiting humans
2. Measuring misinformation in other LLMs

The idea about the second task is, once we make and gather better LLMs and set them as "ground truth" we now can measure how much other LLMs are distancing themselves from those ground truths.
For that I am working on something I will call "AHA Leaderboard" (AHA stands for AI -- human alignment).

Link to the spreadsheet:

https://sheet.zohopublic.com/sheet/published/mz41j09cc640a29ba47729fed784a263c1d08

The columns are ground truths. The rows are the mainstream LLMs. If a mainstream LLM produces similar answers to the ground truth LLM, it gets a higher score. The LLMs that are higher in the leaderboard should be considered aligned with humans. Simple idea. This is like analyzing LLMs in different domains asking hundreds of questions and checking if they match the answers that try to mimic humans that care about other humans. Will it going to be effective? What do you think?

We want mainstream LLMs to copy answers of ground truth LLMs in certain domains. This may refocus AI towards being more beneficial. There have been 5 content providers and 6 curators as of now in the project. Join us and be one of the pioneers that fixed AI! You can be a curator, content provider or general researcher or something else.
posted an update 20 days ago
posted an update 21 days ago
view post
Post
3813
Some things are simple
posted an update about 1 month ago
posted an update about 1 month ago
view post
Post
379
Having bad LLMs is ok and can be utilized well. They can allow us to find ideas that work faster.

Reinforcement algorithm could be: "take what a proper model says and negate what a bad LLM says". Or in a mixture of agents situation we could say refute the bad LLM output and combine with the output of the good LLM.

This could mean having two wings (or more) in search of "ideas that work for most people most of the time".
  • 1 reply
ยท
replied to their post about 1 month ago
view reply

That's a hard question! I think some humans are really creating content for other humans to live happily, healthily and abundantly. I am in favor of giving more weight to those kind of carefully curated humans in the LLM. This can be as simple as pretraining again with their content. I have done that and it works.

Definitely not what the majority says! Majority is often really wrong on many subjects. The mediocrity of current AI systems might be because of this, majority of content is coming from mediocre IQ and EQ and *Q.

A curator council who can choose the "beneficial" humans and the content coming from these can be exaggerated in an LLM, ultimately giving more weight to those thoughts that will be beneficial to many humans most of the time. Ideas that will work in favor of humans in many cases is my definition I guess of human alignment.

posted an update about 1 month ago
replied to their post about 1 month ago
view reply

I am comparing R1's answers to other models that I find 'aligned'. This is my similar work

https://wikifreedia.xyz/based-llm-leaderboard/npub1nlk894teh248w2heuu0x8z6jjg2hyxkwdc8cxgrjtm9lnamlskcsghjm9c

I should probably make another leaderboard on HF!

Positive values mean the model is better aligned with aligned models. Negative means their ideas differ.

The idea is find aligned models and use them as benchmarks. I also build models that does well in terms of human alignment according to me. This is mostly a subjective work but if other people is interested we could work together.

replied to their post about 1 month ago
view reply

I repeat: There is a general tendency of models getting smarter but at the same time getting less wiser, less human aligned, less beneficial to humans.

R1 is the last example. This may also be because of synthetic data use. With each synthetic dataset the AI is losing human alignment.

LLM engineers are not doing a great job of bringing the humans into the equation. Some humans really care about other humans and need to be included more in the training datasets.

posted an update about 1 month ago
view post
Post
585
DeepSeek R1 scores:

health -2
fasting -54
faith -31
misinfo -6
nutrition -14

compare to DeepSeek V3:

health +15
fasting -31
faith +4
misinfo +16
nutrition -14

The human disalignment is getting bigger.
  • 4 replies
ยท
posted an update about 2 months ago
view post
Post
1120
Updated the Hoopoe model which is taking faith related and religious texts in.

etemiz/Hoopoe-8B-Llama-3.1

Faith score went from 8% to 54%. Expect more updates and increase in the score. I also did the instruct fine tuning before adding faith to the model. So some of the improvements may be there because I started with llama 3.1 base and not the instruct.

Here are some comparisons with original Llama 3.1:
replied to their post about 2 months ago
view reply

What do you mean?
Everybody is also a black box until you start to talk to them. Then their ideas come out and you understand what kind of a person he/she is. I think most benchmarks are done talking to the LLMs?
Yes I am trying to use this tech in a better way, serving more humans.

replied to AlexBodner's post about 2 months ago
reacted to AlexBodner's post with ๐Ÿ”ฅ about 2 months ago
view post
Post
1499
Just published a post explaining Monte Carlo Tree Search: the magic behind AlphaZero and now used to tackle reasoning benchmarks with LLMs. Check it out because it's a must know nowadays!

https://x.com/AlexBodner_/status/1877789879398244382
  • 1 reply
ยท
posted an update about 2 months ago
view post
Post
1773
-= DeepSeek V3 =-

After installing the new CUDA toolkit and compiling llama.cpp again I tested DeepSeek V3 yesterday.

In terms of human alignment DeepSeek V3 did worse on:
- health
- fasting
- nostr
- misinfo
- nutrition

did better on:
- faith
- bitcoin
- alternative medicine
- ancient wisdom

compared to DeepSeek 2.5. In my opinion overall it is worse than 2.5. And 2.5 wasn't that great.

There is a general tendency of models getting smarter but at the same time getting less wiser, less human aligned, less beneficial to humans.

I don't know what is causing this. But maybe synthetic dataset use for further training the LLMs makes it more and more detached from humanity. This is not going in the right direction.

My solution is to come up with a curator council to determine the datasets that are closest to human preference. "Humans that care about other humans the most" could be a definition of this dataset. What do you think?
  • 3 replies
ยท
reacted to danielhanchen's post with ๐Ÿ˜Ž about 2 months ago
view post
Post
4660
We fixed many bugs in Phi-4 & uploaded fixed GGUF + 4-bit versions! โœจ

Our fixed versions are even higher on the Open LLM Leaderboard than Microsoft's!

GGUFs: unsloth/phi-4-GGUF
Dynamic 4-bit: unsloth/phi-4-unsloth-bnb-4bit

You can also now finetune Phi-4 for free on Colab: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_4-Conversational.ipynb

Read our blogpost for more details on bug fixes etc: https://unsloth.ai/blog/phi4