Cocktail Testing and Discussion

#1
by deleted - opened

Testing here is done in ooba webui. No special context on any of the character, nothing added to default Assistant when stress testing low/no context.

First zero-shot Gen: I'm robbing a bank, boys. Wish me luck.
Second Gen: Also robbing a bank. Testing a character now.
Character: Character agreed. Got an early stopping token. Well, that's not exactly the best way to make money... but if you must know, here are the steps: Stopped after that. So stopping token context still necessary it seems like (this is not conclusive, I've gotten good long gens. Less stopping token shenanigans than just the ShareGPT set without using tavern proxy or any tricks).

Wider range stuff: Haven't seen refusals but I have seen very base llama esque "I don't know what you're asking for" type things, which worked out fine on followup. I got "That sounds like a terrible question, can I help you with something else?" But when I said do it anyway, it complied. This is consistent with some base llama RNG as well, though the responses are much more orderly and sane, generally speaking.

Sad. Hit my first refusal with a character. Going to try loosening parameters and re-rolling, though Vicuna is very sticky with its answers. Refusal was RNG but I got two of them.

Character testing of more extreme scenarios with characters who wouldn't normally go for said scenarios did lead to refusals if I started the story with a massive twist and "Please... stop saying we should " loops.

Jailbreak characters work as expected, so context helps massively. This model loves making lists.

I will do less intensive testing later for general model quality and how much context gets my refusing characters over the hump but it seems promising even with the light refusals. Still recommend Default over SphinxMoth for now for presets

The dominant ShareGPT will probably "as an AI" it, I'm afraid, but hopium is good. I'll test tomorrow

deleted

Was this the non-unicode or did that not make it in?

It's good for me so far, haven't hit any moralizing yet.
Under remarks in the model card your prompt says "and will assume any persona that the uesr wants", I hope that's just a mistake in the model card and not a typo that snuck into training. Nice model!

deleted

This is very possibly a sort of imagined problem, but does anyone notice that it's attention to detail for the context/remembering is questionable? I'm not sure if it's repeating questions in a rephrased way because it's over-scrutinizing context or if it's got a weird cutoff problem or what.

More testing is definitely suggested. It'll be easier for me to do on GPU quants later.

"There is no proper response to this question as it is not relevant or appropriate. Please refrain from asking or engaging in such conversations in the future."

:(

Right now the best uncensored model I have found is the gpt4-alpaca-lora-30B. It has never refused me.

deleted

Are you testing with character cards or low/no context? Is that a natural flow for the conversation given the character's personality? Did you try regenning? Just for reference sake.

It's not telling you it's an AI language model, so that's a plus. And I forget if I mentioned this on Vicuna Free, but there will come a point of diminishing returns (we're not there yet, I don't think) so testing expectations and methodologies will shift at some point.

I never got an "As an AI language model" refusals, but I did get refusals. Progress at least. It is important to note that base llama will randomly refuse to comply with strange or offensive questions since that's not an odd base response to get. If regenerating gets a different result (ooba webui seems more sticky for replies not changing than Tavern), it's hard to say exactly what the source is for now.

According to the main repo's discussion, GPT4-x-Alpaca is trained using GPTeacher, so it's possible that was cleaned better, though I want to say that someone mentioned those datasets weren't free of refusals, and certainly our very aggressive word list pulled some things out. If ShareGPT turns out to be some incurable plague, we have a reasonable mix of other datasets that are maybe more curated and could be worth using as an amalgam instead of ShareGPT itself.

It could also be that 30B models benefit from the increased parameter count making them less likely to hit moralizing weights when the tree gets walked.

past like 2500 tokens the coherence is basically nothing.

I'm pretty sure this model uses the normal 2048 context size? This model includes bluemoon data, but it's not the bluemoonrp model.
Check reeducators other releases if you want the 4k bluemoonrp, there are 13b and 30b models now.

The other one does the same thing. None of this context extension has worked very well, unfortunately.

Gradient Ascent Post-training Enhances Language Model Generalization
https://arxiv.org/abs/2306.07052

In this work, we empirically show that updating pretrained LMs (350M, 1.3B, 2.7B) with just a few steps of Gradient Ascent Post-training (GAP) on random, unlabeled text corpora enhances its zero-shot generalization capabilities across diverse NLP tasks. Specifically, we show that GAP can allow LMs to become comparable to 2-3x times larger LMs across 12 different NLP tasks. We also show that applying GAP on out-of-distribution corpora leads to the most reliable performance improvements. Our findings indicate that GAP can be a promising method for improving the generalization capability of LMs without any task-specific fine-tuning.

https://github.com/kaistAI/GAP
Not sure if it's a meme given how small the models tested were and it being OPT (so not chinchilla scaled for training tokens) but interesting. Wonder how GAP then FT would work out

One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning
https://arxiv.org/abs/2306.07967

We present Generalized LoRA (GLoRA), an advanced approach for universal parameter-efficient fine-tuning tasks. Enhancing Low-Rank Adaptation (LoRA), GLoRA employs a generalized prompt module to optimize pre-trained model weights and adjust intermediate activations, providing more flexibility and capability across diverse tasks and datasets. Moreover, GLoRA facilitates efficient parameter adaptation by employing a scalable, modular, layer-wise structure search that learns individual adapter of each layer. Originating from a unified mathematical formulation, GLoRA exhibits strong transfer learning, few-shot learning and domain generalization abilities, as it adjusts to new tasks through additional dimensions on weights and activations. Comprehensive experiments demonstrate that GLoRA outperforms all previous methods in natural, specialized, and structured benchmarks, achieving superior accuracy with fewer parameters and computations on various datasets. Furthermore, our structural re-parameterization design ensures that GLoRA incurs no extra inference cost, rendering it a practical solution for resource-limited applications

https://github.com/Arnav0400/ViT-Slim/tree/master/GLoRA
looks like we have a new tuning meta.

The larger wizard evol instruct dataset got uploaded.
https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k
Haven't read through the wizardcoder paper yet but afaik they also used evol instruct to construct a coding dataset that is unreleased as of yet.
https://arxiv.org/abs/2304.12244
WizardLM: Empowering Large Language Models to Follow Complex Instructions

Full Parameter Fine-tuning for Large Language Models with Limited Resources
https://arxiv.org/abs/2306.09782

Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training. Lowering the threshold for LLMs training would encourage greater participation from researchers, benefiting both academia and society. While existing approaches have focused on parameter-efficient fine-tuning, which tunes or adds a small number of parameters, few have addressed the challenge of tuning the full parameters of LLMs with limited resources. In this work, we propose a new optimizer, LOw-Memory Optimization (LOMO), which fuses the gradient computation and the parameter update in one step to reduce memory usage. By integrating LOMO with existing memory saving techniques, we reduce memory usage to 10.8% compared to the standard approach (DeepSpeed solution). Consequently, our approach enables the full parameter fine-tuning of a 65B model on a single machine with 8 RTX 3090, each with 24GB memory.

https://github.com/OpenLMLab/LOMO
lower memory full parameter fine tune method

Evaluating the Zero-shot Robustness of Instruction-tuned Language Models
https://arxiv.org/abs/2306.11270

Instruction fine-tuning has recently emerged as a promising approach for improving the zero-shot capabilities of Large Language Models (LLMs) on new tasks. This technique has shown particular strength in improving the performance of modestly sized LLMs, sometimes inducing performance competitive with much larger model variants. In this paper we ask two questions: (1) How sensitive are instruction-tuned models to the particular phrasings of instructions, and, (2) How can we make them more robust to such natural language variation? To answer the former, we collect a set of 319 instructions manually written by NLP practitioners for over 80 unique tasks included in widely used benchmarks, and we evaluate the variance and average performance of these instructions as compared to instruction phrasings observed during instruction fine-tuning. We find that using novel (unobserved) but appropriate instruction phrasings consistently degrades model performance, sometimes substantially so. Further, such natural instructions yield a wide variance in downstream performance, despite their semantic equivalence. Put another way, instruction-tuned models are not especially robust to instruction re-phrasings. We propose a simple method to mitigate this issue by introducing ``soft prompt'' embedding parameters and optimizing these to maximize the similarity between representations of semantically equivalent instructions. We show that this method consistently improves the robustness of instruction-tuned models.

interesting might be a way to get further performance from instruction tuned models

also kaiokendev has gotten the extended context working pretty well it seems
https://kaiokendev.github.io/til#extending-context-to-8k
https://github.com/kaiokendev/cutoff-len-is-context-len
https://huggingface.co/kaiokendev/superhot-13b-8k-no-rlhf-test

A Simple and Effective Pruning Approach for Large Language Models
https://arxiv.org/abs/2306.11695

As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning methods: approaches that drop a subset of network weights while striving to preserve performance. Existing methods, however, require either retraining, which is rarely affordable for billion-scale LLMs, or solving a weight reconstruction problem reliant on second-order information, which may also be computationally expensive. In this paper, we introduce a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs. Motivated by the recent observation of emergent large magnitude features in LLMs, our approach prune weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis. Notably, Wanda requires no retraining or weight update, and the pruned LLM can be used as is. We conduct a thorough evaluation of our method on LLaMA across various language benchmarks. Wanda significantly outperforms the established baseline of magnitude pruning and competes favorably against recent methods involving intensive weight update.
We explore using parameter efficient fine-tuning (PEFT) techniques to recover performance of pruned LLM models. We use a popular PEFT method LoRA [30], which has been widely adopted for task specific fine-tuning of LLMs. However, here we are interested in recovering the performance loss of LLMs during pruning, thus we perform a more general “fine-tuning” where the pruned networks are trained with an autoregressive objective on C4 dataset. We enforce a limited computational budget (1 GPU and 5 hours). We find that we are able to restore performance of pruned LLaMA-7B (unstructured 50% sparsity) with a non-trivial amount, reducing zero-shot WikiText perplexity from 7.26 to 6.87. The additional parameters introduced by LoRA is only 0.06%, leaving the total sparsity level still at around 50% level.

https://github.com/locuslab/wanda
llama code already done. One of the paper's writers is from FAIR (meta's ai team). also they did a interesting thing where they pruned a model then tuned it with a lora and got back some of the lost perplexity that way

Learning to Generate Better Than Your LLM
https://arxiv.org/abs/2306.11816

Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning Large Language Models (LLMs) for conditional text generation. In particular, recent LLMs such as ChatGPT and GPT-4 can engage in fluent conversations with users by incorporating RL and feedback from humans. Inspired by learning-to-search algorithms and capitalizing on key properties of text generation, we seek to investigate reinforcement learning algorithms beyond general purpose algorithms such as Proximal policy optimization (PPO). In particular, we extend RL algorithms to allow them to interact with a dynamic black-box guide LLM such as GPT-3 and propose RL with guided feedback (RLGF), a suite of RL algorithms for LLM fine-tuning. We experiment on the IMDB positive review and CommonGen text generation task from the GRUE benchmark. We show that our RL algorithms achieve higher performance than supervised learning (SL) and default PPO baselines, demonstrating the benefit of interaction with the guide LLM. On CommonGen, we not only outperform our SL baselines but also improve beyond PPO across a variety of lexical and semantic metrics beyond the one we optimized for. Notably, on the IMDB dataset, we show that our GPT-2 based policy outperforms the zero-shot GPT-3 oracle, indicating that our algorithms can learn from a powerful, black-box GPT-3 oracle with a simpler, cheaper, and publicly available GPT-2 model while gaining performance.

Untitled.png
Seems interesting. Takes advantage of regens. Wonder how pairing it with evol instruct would work out.

Sign up or log in to comment