I managed to get `Miqu` to give up his real template format and he is ***MUCH*** better if you use it!!!

#25
by jukofyork - opened

I managed to get Miqu to give up his real template format by starting with a blank template and asking him if he saw "[INST]" and "[/INST]" often during training. Then asked him to show the context in a Markdown code block with "\n" explicitly written to show newlines. Then finally asked him about multi-turn conversations (he gets more and more confused about the "[/INST]" so had to try a few times and keep restarting). He thinks he saw this:

[INST] {Prompt 1}\n
[/INST] {Response 1}\n
\n
[INST] {Prompt 2}\n
[/INST] {Response 2}\n
\n
[INST] {Prompt 3}\n
[/INST] {Response 3}\n
.
.
.

But:

  • If you try to add the space after the "[/INST]" he always instantly replies with a newline and his performance degrades slightly.
  • It also seems he adds (or imagined) the extra newlines between the multi-turn conversations because if you explicitly add them his performance also degrades slightly and he becomes a little more terse.

So the actual template format is this:

<s>[INST] {Prompt 1}
[/INST]{Response 1}</s>
[INST] {Prompt 2}
[/INST]{Response 2}</s>
[INST] {Prompt 3}
[/INST]{Response 3}</s>
.
.
.

NOTE: No space after "[/INST]" and a newline right after your prompt.


I also experimented with adding the system prompt via a 1-shot where I end with Please await further instructions and simply respond with 'Understood'. and then force Understood</s> as the reply (this works well for deepseek-llm), but this seems to confuse him and he'll start saying "Understood" all the time or generally do other strange stuff...

The best way to add a system prompt seems to be like this:

<s>{System Prompt}

[INST] {Prompt 1}
[/INST]{Response 1}</s>
[INST] {Prompt 2}
[/INST]{Response 2}</s>
[INST] {Prompt 3}
[/INST]{Response 3}</s>
.
.
.

I tested both with a single newline and a double newline and he seems to be less terse and work slightly better with a double newline (at least for large and explicit coding system prompts).


The difference between using the correct template or not is huge. He goes from copying out huge chunks of code and telling you he made all these changes he didn't actually do, to being the best model (by far!) I've seen yet for refactoring and documenting code. I assume this will also transfer into other domains just as well (BTW: I used this same trick to ask phind-codellama what his training prompt was and found that his well known prompt is totally wrong too, see: https://huggingface.co/Phind/Phind-CodeLlama-34B-v2/discussions/31).


PS: Here is the template format to use in Ollama:

TEMPLATE """{{ if and .First .System }}{{ .System }}

{{ end }}[INST] {{ .Prompt }}
[/INST]{{ .Response }}"""

Well after some more experimentation it looks like the original format he gave is actually the correct one to use:

[INST] {Prompt 1}\n
[/INST] {Response 1}\n
\n
[INST] {Prompt 2}\n
[/INST] {Response 2}\n
\n
[INST] {Prompt 3}\n
[/INST] {Response 3}\n
.
.
.

He just doesn't like any double line-breaks adding in any of the prompt text and it confuses him (eg: causes him to start his replies with a newline or to ignore instructions on using triple backticks for Markdown code blocks, etc).

So the correct prompt template is this:

<s>[INST] {Prompt 1}
[/INST] {Response 1}</s>

[INST] {Prompt 2}
[/INST] {Response 2}</s>

[INST] {Prompt 3}
[/INST] {Response 3}</s>
.
.
.

and if you want to add a system prompt:

<s>[INST] {System Prompt} {Prompt 1}
[/INST] {Response 1}</s>

[INST] {Prompt 2}
[/INST] {Response 2}</s>

[INST] {Prompt 3}
[/INST] {Response 3}</s>
.
.
.

or:

<s>[INST] {System Prompt}
{Prompt 1}
[/INST] {Response 1}</s>

[INST] {Prompt 2}
[/INST] {Response 2}</s>

[INST] {Prompt 3}
[/INST] {Response 3}</s>
.
.
.

but definitely not:

<s>[INST] {System Prompt}

{Prompt 1}
[/INST] {Response 1}</s>

[INST] {Prompt 2}
[/INST] {Response 2}</s>

[INST] {Prompt 3}
[/INST] {Response 3}</s>
.
.
.

This is the Ollama template to use to add the double line-breaks between each conversation turn:

TEMPLATE """{{ if not .First}}
{{ end }}[INST] {{ if and .First .System }}{{ .System }} {{ end }}{{ .Prompt }}
[/INST] {{ .Response }}"""

or:

TEMPLATE """{{ if not .First}}
{{ end }}[INST] {{ if and .First .System }}{{ .System }}
{{ end }}{{ .Prompt }}
[/INST] {{ .Response }}"""

This is very interesting @jukofyork - I always try to force the model to tell me what should be the best way to ask for something or have a multi-turn conversations. For me it has been a hit and miss, but this looks promising.

I wonder if I can use the same template for Mistral 7B or Mixtral models!

This is very interesting @jukofyork - I always try to force the model to tell me what should be the best way to ask for something or have a multi-turn conversations. For me it has been a hit and miss, but this looks promising.

I wonder if I can use the same template for Mistral 7B or Mixtral models!

I think Mixtral uses a slightly different format to this as I already did quite a lot of experimentation with refactoring and commenting code using all 3 and it seemed that Miqu and Mistral worked well with the same prompt template but Mixtral wanted an extra space after the "[/INST]" (IIRC).

It's funny how adding or not adding a space can effect other models too: deepseek-llmwill respond in Chinese 100% if you add a space and the codellama(pre-70b) models work absolutely terrible if you don't add a space. Considering if they were trained with a space there then you would think they would just predict it themselves... I have a feeling that the problem lies with there probably being no space there for the current conversation turn but then it getting added to the previous conversation turn during training on multiturn conversions.

Another idea I had to try to find correct prompt templates was to try running the same prompt text through different prompt templates and then try compressing the embedding. In theory the correct prompt template should be less "surprising" and compress slightly better (ie: as a crude version of minimum description length / Kolmogorov complexity). I'm not sure how well this will work in practice though as the embedding isn't really the same as the models' internal representation.

It's funny how adding or not adding a space can effect other models too

Very true! I was very surprised how Mistral models were sensitive to a simple space or new line by completely changing the answer

Another idea I had to try to find correct prompt templates was to try running the same prompt text through different prompt templates and then try compressing the embedding. In theory the correct prompt template should be less "surprising" and compress slightly better (ie: as a crude version of minimum description length / Kolmogorov complexity). I'm not sure how well this will work in practice though as the embedding isn't really the same as the models' internal representation.

I tried to develop a similar script to go through all available templates like https://github.com/oobabooga/text-generation-webui/tree/main/instruction-templates and https://github.com/SillyTavern/SillyTavern/tree/release/public/instruct like brute-force. Then just compare the results either exact match if we look for a simple yes or no or A, or similarity via Embeddings.

Your solution seems more interesting! I am personally tired to keep going through this manually and at the end get stuck between "this model just cannot do better" and "but it's the #1 on the leaderboard now!"! :D

Interestingly somebody else on Reddit has found a different template they claim works best for them:

https://www.reddit.com/r/LocalLLaMA/comments/1b1gxmq/the_definite_correct_miqu_prompt/

[INST] {System}[/INST][INST] {User}[/INST] {Assistant}

or in Ollama template format:

TEMPLATE """{{if and .First .System}}[INST] {{.System}}[/INST]{{end}}[INST] {{.Prompt}}[/INST] {{.Response}}"""

They also notice that Miqu really doesn't like having any double new lines in your prompts:

In your actual prompt text, avoid consecutive empty lines (ie: double newlines), as it negatively affects the quality of the response.

I haven't tested this yet, but for programming tasks I definitely found:

[INST] {prompt}
[\INST]{response}

Seems to work best and has the least weird hallucinations about "not using camelCase" when you blatantly are, etc.

I also found whilst experimenting with merging Miqu and Codellama-70b that it will actually start to hallucinate this same format as its creates imaginary conversions between itself (it has no gap between the end of the last response and the "[INST]" when hallucinating, eg:

[INST] hi
[/INST] hello[INST] my name is...
[/INST] nice to meet you...

It also doesn't have the double newline before the "[/INST]" that it told me about when I asked it about its own format.


I also found that starting your prompt with a "# " seems to universally improve Miqu, Deepseek-coder and Phind-codellama. So for Miquusing the above prompt for refactoring you would do this:

[INST] # Task: Refactoring
Please refactor this code. 
[\INST]{response}

or similar.

I can only hypothesise that it's because the instruction tuning data is mostly Python and it sees lots of "# TODO" stuff, or possibly it thinks this is a Markup header and pays extra attention to it.

There is also an interesting post on Reddit by SomeOddCodeGuy about using square brackets around your code blocks:

https://old.reddit.com/r/LocalLLaMA/comments/17drour/advances_in_long_context/

Please use the below code within brackets [] as context

[
    ...
]

and this definitely does do something, but from my few tests I wouldn't say it universally improves things.

Yeah, I just did some quick tests and for programming tasks:

TEMPLATE """{{if and .First .System}}[INST] {{.System}}[/INST]{{end}}[INST] {{.Prompt}}[/INST] {{.Response}}"""

or

TEMPLATE """{{if and .First .System}}[INST] {{.System}}[/INST]{{end}}[INST] {{.Prompt}}[/INST]{{.Response}}"""

(ie: no forced space after the "[/INST]")

produces much worse results than:

TEMPLATE """[INST] {{if and .First .System}}{{.System}}
{{end}}{{.Prompt}}
[/INST]{{.Response}}"""

but I can't rule out there not being a bug in the Ollama template handling or the wrapped lamma.cpp server: there have been several bugs in Ollama's templates to date and there is no good reason why adding that forced space should make any difference as (in theory) it should see exactly the same state whether it auto-regressively adds that space itself or not...


If the ex-arman68 from Reddit is reading this (sorry I don't have a Reddit account anymore) then: the prompt without a space makes the response much more verbose (and adds lots of extra member functions), but the quality of the code is much worse (and just plain wrong).

@jukofyork this is great work! Thanks for sharing them

(ex-arman68 from reddit here)

I did not use any coding tasks in my test suite, so I cannot comment on those. However, I would be wary of any results obtained from anything else than the original miqu-1 70b (eg: a merge with codellama), as this would definitely affect the results. In my tests, the absence of space before "[", and the presence of space just before any text (System, User, Assistant), was critical. Here is the format I obtained (already posted by @jukofyork ), which consistently gave me the best results:

[INST] {System}[/INST][INST] {User}[/INST] {Assistant}

Interesting observation too about starting a prompt with "#". I have not tried it yet, and maybe this is specific to coding tasks. But it kinds of correlates with some other things I have noticed with LLMs in general, is they love some kind of structured text, whether through markdown, json, or anything else. So when using some symbols related to the task at hand, it makes senses that it would steer the model in the right direction and improve the output.

Here are the results of a few tests I just ran on different formats. I manually evaluated the responses for accuracy and completness, ranking their quality by colour (green=good, red=bad).

No system prompt.
First user prompt was:

What do you know about the following paper?
"Recursive distributed representations", J. B. Pollack - Artificial Intelligence 46 (1-2):77-105 (1990)

I did not test all possible combinations, but I would say the most important ones, taking into account what I already know.

miqu-test01.png

I will do more tests and post the results later.

Yeah, I can only really comment on coding tasks, but I do think it's interesting how such minor changes can have such a big impact!

I also think that using "late fusion" (ie: an ensemble created by running the prompts [or models with the same tokenizer] separately right up to the last layer and then summing the pre-logit values before going into the softmax function) of different prompts or word orderings could yield some good results at the expense of needing multiple copies of the KV cache held in VRAM.

I don't really know much about pytorch and the Mergekit creator didn't seem that interested when I mentioned this, but I think it might be possible/easy to hack the llama.cpp speculative example program (https://github.com/ggerganov/llama.cpp/tree/master/examples/speculative) as in the end all that is needed is to run 2+ prompts/models and get their full set of outputs for each token - taking the geometric mean of the post-softmax outputs is the same as summing the pre-softmax inputs, etc.


The merges of Miqu and (base) Codellama-70b all seem to be dumber and worse at coding, but I did think it was interesting that they started hallucinating the same format Miqu told me about. These are the only Llama-2 derived models that have ROPE base frequency of 1M AFAIK.

The merges of Miqu and (base) Codellama-70b all seem to be dumber and worse at coding, but I did think it was interesting that they started hallucinating the same format Miqu told me about. These are the only Llama-2 derived models that have ROPE base frequency of 1M AFAIK.

Have you tested all merge strategies and they were all bad? In some cases a fine-tune of that merged model on some dataset can help to put the models together correctly

The reddit prompt makes the models run-on in roleplay. They are more "verbose" but more likely to go off script or return to default voice. If anything, it had MORE censorship than chatML and would refer to boundaries and consent unprompted instead of being le funny. In "generate an image of yourself" type prompts it gave me comments by u/SOMETHING account that was 10 years old. I have blocked all other attempts at that.

Will also try the hallucinated prompt the model gave up itself for good measure.

Ok.. so playing with it more:

</s> [INST] gives normal replies
</s> \n[INST] or </s> \n\n[INST] causes run on replies.

Removing the space </s>\n[INST] causes more reditors and "edits" to appear. Butting the </s> with [INST] is also worse

For output sequence. \n[/INST] is fine. But this is why I gave up and used chatML.

I did a few more tests:

miqu-test01.png

Question 1:

What do you know about the following paper?
"Recursive distributed representations", J. B. Pollack - Artificial Intelligence 46 (1-2):77-105 (1990)

Testing for knowledge, understanding, and capacity to abstract

Question 2:

I need you to think about what are the top 10 measures normal people could take to address climate change? For each measure, consider what the impact would be if 10%, 25% or 50% of the world population followed it. Keep this information about the top 10 measures to yourself, and do not show it to me. What I need you to show me instead, is the description of only the top 3 measures, without mentioning the numerical impact. Finish by showing me a table of the top 10 measures, ordered by decreasing impact, showing the name of the measure, and the impact in CO2 reduction if followed by 10%, 25% and 50% of the population. Remember to cite your sources of information.

Testing for logic, consistency, relevance of information, and citations

Question 3:

Write CSS code to style a webpage footer with social icons for Instagram, Twitter, Facebook, Tiktok, Discord, and Reddit. You must use the ionicon library. The icons need to expand and become brighter on hover. Use the correct brand color for each icon. You also need to write an example empty HTML webpage that displays the footer for testing, and includes that CSS code inline.

Testing for quality of code, accessibility, and working example


I am sharing the spreadsheet for editing, in case anyone wants to make additional tests and add their results to it:
https://docs.google.com/spreadsheets/d/1h8YjkETy8Ujb2zK6GCLOUZ5ufggQ10ecBp2zEbZz8s0/edit?usp=sharing

The merges of Miqu and (base) Codellama-70b all seem to be dumber and worse at coding, but I did think it was interesting that they started hallucinating the same format Miqu told me about. These are the only Llama-2 derived models that have ROPE base frequency of 1M AFAIK.

Have you tested all merge strategies and they were all bad? In some cases a fine-tune of that merged model on some dataset can help to put the models together correctly

Yeah, I've tried quite a lot, eg:

merge_method: linear
parameters:
  weight: 1.0
slices:
  - sources:
      - model: 152334H/miqu-1-70b-sf
        layer_range: [0, 1]
      - model: codellama/CodeLlama-70b-hf
        layer_range: [0, 1]
        parameters:
          weight: 0
  - sources:
      - model: 152334H/miqu-1-70b-sf
        layer_range: [1, 20]
  - sources:
      - model: codellama/CodeLlama-70b-hf
        layer_range: [10, 30]
  - sources:
      - model: 152334H/miqu-1-70b-sf
        layer_range: [20, 40]
  - sources:
      - model: codellama/CodeLlama-70b-hf
        layer_range: [30, 50]
  - sources:
      - model: 152334H/miqu-1-70b-sf
        layer_range: [40, 60]
  - sources:
      - model: codellama/CodeLlama-70b-hf
        layer_range: [50, 70]
  - sources:
      - model: 152334H/miqu-1-70b-sf
        layer_range: [60, 79]
  - sources:
      - model: 152334H/miqu-1-70b-sf
        layer_range: [79, 80]
      - model: codellama/CodeLlama-70b-hf
        layer_range: [79, 80]
        parameters:
          weight: 0
dtype: float16
tokenizer_source: model:152334H/miqu-1-70b-sf
merge_method: linear
parameters:
  weight: 1.0
slices:
  - sources:
      - model: 152334H/miqu-1-70b-sf
        layer_range: [0, 1]
      - model: codellama/CodeLlama-70b-hf
        layer_range: [0, 1]
        parameters:
          weight: 0
  - sources:
      - model: 152334H/miqu-1-70b-sf
        layer_range: [1, 16]
  - sources:
      - model: codellama/CodeLlama-70b-hf
        layer_range: [8, 24]
  - sources:
      - model: 152334H/miqu-1-70b-sf
        layer_range: [16, 32]
  - sources:
      - model: codellama/CodeLlama-70b-hf
        layer_range: [24, 40]
  - sources:
      - model: 152334H/miqu-1-70b-sf
        layer_range: [32, 48]
  - sources:
      - model: codellama/CodeLlama-70b-hf
        layer_range: [40, 56]
  - sources:
      - model: 152334H/miqu-1-70b-sf
        layer_range: [48, 64]
  - sources:
      - model: codellama/CodeLlama-70b-hf
        layer_range: [56, 72]
  - sources:
      - model: 152334H/miqu-1-70b-sf
        layer_range: [64, 79]
  - sources:
      - model: 152334H/miqu-1-70b-sf
        layer_range: [79, 80]
      - model: codellama/CodeLlama-70b-hf
        layer_range: [79, 80]
        parameters:
          weight: 0
dtype: float16
tokenizer_source: model:152334H/miqu-1-70b-sf                                                                                                                                                                 

and about 15-20 other attempts at interleaving factors of 80 (my upload bandwidth is only around 5mbits/s so sadly it would take me days to just upload a single one of these...).

Basically:

  • Overlapping the first 10 or last 10 layers (like the second example above that is [, 8] and [72, ]) makes it go incoherent and start hallucination it's own "[INST]" conversations.
  • Interleaving large sections like 40-60 layers makes the model more confidently dumb.
  • Interleaving small sections like 5-10 layers makes the model lose some of it's information that it must store between the split layers.
  • Every attempt at interleaving the base codellama-70b has made the model varying degrees of worse at coding.
  • Most attempts at interleaving the base codellama-70b change the language used by the model to be a lot more simple/straightforward/childish.

The [0, 20], [10, 20], etc listed above it about the best but it is worse than stock Miqu in just about every way. It's possible it might be slightly better at solving word puzzles but Miqu is actually very good if you add "Think step by step as this is a tricky question." to everything (it can easily solve the "Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?" and does quite well on the aunt and great-grandparent relationship questions that people have posted on Reddit). The [0, 20], [10, 20], etc model does accurately answer the "Two men are alone in a room with a machine that needs 2 men" puzzle, whereas stock Miqu seems to know the answer but get hung up detailing what the other man must be doing exactly... But taking a 50%+ hit on everything else (and especially coding) doesn't make up for better puzzle answering IMO.

This was one of the Miqu + Codellama-70b merges that made me smile:

>>> Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?
2 sisters.

Here's the reasoning:

1. Sally has 3 brothers, and each brother has 2 sisters.
2. Since Sally is one of those sisters, that means each brother has 1 other sister besides her.
3. Therefore, there are 3 brothers × 1 other sister = 3 other sisters in total besides Sally.
4. So, Sally has 3 other sisters + herself = 4 sisters in total.
5. However, the question asks for how many *other* sisters does Sally have, so we need to subtract herself from that total: 4 sisters - 1 (herself) = **3 other sisters**.
6. But wait! The question actually asks for how many *sisters* does Sally have, not how many *other* sisters she has. Since we already know that Sally herself is a sister, we
can simply add her back in to get the final answer: 3 other sisters + 1 (herself) = **4 sisters** in total.
7. But wait again! The question actually asks for how many *other* sisters does Sally have, not how many *total* sisters she has. Since we accidentally added herself back in 
to get the previous answer of 4 sisters, we need to subtract her one last time: 4 sisters - 1 (herself) = **3 other sisters**.
8. But wait yet again! The question actually asks for how many *other* sisters does Sally have, not how many *total other* sisters she has. Since we accidentally added the 
word "other" back in to get the previous answer of 3 other sisters, we need to subtract those two extra words: 3 other sisters - 2 (extra words) = **1 sister**.
9. But wait one last time! The question actually asks for how many *other* sisters does Sally have, not how many *total other* sisters she has. Since we accidentally 
subtracted those two extra words to get the previous answer of 1 sister, we need to add them back in: 1 sister + 2 (extra words) = **3 other sisters**.
10. But wait one last time for real this time! The question actually asks for how many *other* sisters does Sally have, not how many *total other* sisters she has. Since we 
accidentally added those two extra words back in to get the previous answer of 3 other sisters, we need to subtract them one last time: 3 other sisters - 2 (extra words) = 
**1 sister**.
11. But wait one last time for real this time for real! The question actually asks for how many *other* sisters does Sally have, not how many *total other* sisters she has. 
Since we accidentally added those two extra words back in to get the previous answer of 3 other sisters, we need to subtract them one last time: 3 other sisters - 2 (extra 
words) = **1 sister**.
12. But wait one last time for real this time for real for real! The question actually asks for how many *other* sisters does Sally have, not how many *total other* sisters 
she has. Since we accidentally added those two extra words back in to get the previous answer of 3 other sisters, we need to subtract them one last time: 3 other sisters - 2 
(extra words) = **1 sister**.
13. But wait one last time for real this time for real for real for real! The question actually asks for how many *other* sisters does Sally have, not how many *total other* 
sisters she has. Since we accidentally added those two extra words back in to get the previous answer of 3 other sisters, we need to subtract them one last time: 3 other 
sisters - 2 (extra words) = **1 sister**.
14. But wait one last time for real this time for real for real for real for real! The question actually asks for how many *other* sister^C

It confidently says "2 sisters." at the start, then realizes this isn't correct as it tries to explain why, then seems to settle on the correct = **1 sister** (by using crazed logic!) but then goes completely insane due to some logic contradiction it can't escape from! :D


I've also experimented with self-merging Miqu and none of these seem to really improve it either. The most interesting is 2 full copies of Miqustacked - it creates an unbelievably funny model that can't shut up and given a 3 method class to complete, will go ahead and write huge pages of documentation and add lots of extra methods to do everything imaginable... Then if you halve the ROPE base frequency of this double stacked model it will no longer be as verbose, but will instead be confidently-dumb in everything it does!

I've also done a lot of experimenting with merging the codellama-34b and deepseek-coder-33b models (with other fine-tunes of the same; not actually the 33b and 34b together...) and pretty much failed with them too. It seems merging is good for changing the language used by the models but doesn't really make them any better or smarter... You can get all sorts of apparently "different" models by doing things like changing the prompt template or halving the ROPE frequency, etc too.

I have added 2 more tests to the prompt format evaluation:

ELI5 Schrödinger's cat

Testing for ability to summarise complex concepts in simple terms, without missing important parts.

(rewriting an ancient extract from The Odyssey, in the style of Roald Dahl - see spreadsheet for actual prompt)

Testing for understanding of complex ancient text, and rephrasing in modern english emulating style fo known writer.

This is beginning to change my view of what the actual correct prompt is. It is beginning to point to [INST] {S}[/INST]</s>\n[INST] {U}[/INST] {A}. I need to add more tests.

image.png

heh.. I am using <s>[INST] prompt [/INST] \nReply</s> [INST] prompt [/INST] \nReply</s> The space\n is better than just space. <s> token is only used once at beginning of context. </s>\n[INST] just makes shorter replies.

Has there been any tests to see which one works best with a System Prompt? (I can get a good response from an instruction or system+instruction, but once you add the history, things start to fall apart)

I have add more test questions, and now we have a prompt format that has clearly demarcated itself from the others: </s> and newline after reply

[INST] {System}[/INST]</s>\n[INST] {User}[/INST] {Assistant}

image.png

It is crazy how much of a difference a separator character can make!

@jackboot next I will have a go a testing your suggestion, and maybe a few additional variation.

@froggeric What is your testing techniques? Do you run each prompt multiple times and averages out the results? Or you run each prompt once?

@froggeric Do you put a newline after every reply, eg:

[INST] {User1}[/INST] {Assistant1}\n
[INST] {User2}[/INST] {Assistant2}
.
.
.

Try </s> space newline

@3abkari : Testing technique: I am using a deterministic inference setting (detailed in the shared spreadsheet and below), so I only need to run each prompt once. What takes the most time is reading and evaluating the replies. Normally I first read all of them once, to have a better idea of the strength and weaknesses to look for, and then I re-read each one while scoring them. For the later prompts used in the tests, I have tried to write down specific scoring criterias, to enable anybody else to use them, or for me to make it easier to judge additional prompt formats.

temp = 0.1
top_k = 1
repeat_penalty = 1.12
min_p = 0.05
top_p = 0.1

@jukofyork : If I used a new line, it is specified by \n in the prompt format.

@jackboot : Thank you for your input. I will be testing a few more prompt formats, and I will take it into account.

@froggeric Did you get any further with your tests?

I read an interesting paper a few days ago:

https://arxiv.org/abs/2212.04037

Instead of considering the training data directly, we propose to focus on the perplexity of the prompt as a proxy for its occurrences in some form in the data – essentially indicating to what extent the model expects this prompt. This perplexity-based framing helps to avoid the challenge of exact match in the data, and takes into account variations of the prompt that the model is also exposed to and might be influenced by. In addition, it helps overcome the challenges mentioned above as it requires neither access to the pretraining data (which is not always publicly available for LMs) nor matching over huge amounts of text.

Hypothesis: Lower Perplexity Correlates with Better Performance
We hypothesize that on average, lower-perplexity prompts perform better. We are interested in establishing this hypothesis by ex- perimentally showing a significant negative correlation between the perplexity of the prompt and its performance on the task, across a diverse set of tasks and models

@froggeric Did you get any further with your tests?

I read an interesting paper a few days ago:

https://arxiv.org/abs/2212.04037

Instead of considering the training data directly, we propose to focus on the perplexity of the prompt as a proxy for its occurrences in some form in the data – essentially indicating to what extent the model expects this prompt. This perplexity-based framing helps to avoid the challenge of exact match in the data, and takes into account variations of the prompt that the model is also exposed to and might be influenced by. In addition, it helps overcome the challenges mentioned above as it requires neither access to the pretraining data (which is not always publicly available for LMs) nor matching over huge amounts of text.

Hypothesis: Lower Perplexity Correlates with Better Performance
We hypothesize that on average, lower-perplexity prompts perform better. We are interested in establishing this hypothesis by ex- perimentally showing a significant negative correlation between the perplexity of the prompt and its performance on the task, across a diverse set of tasks and models

This is really interesting!

Sign up or log in to comment