fromthesky commited on
Commit
237581f
·
1 Parent(s): 5cca722

Initial commit.

Browse files
PLDRv51G-106M-2-model-checkpoint.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0280da71ea5da0f9b1745084a66f3c772b4c47fa1f38702c56d001ec3e9409ae
3
+ size 422476082
PLDRv51G_106M_2_hyperparameters.py ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch.nn.functional as F
2
+
3
+ hpdict={'num_layers': 5,
4
+ 'd_model': 896,
5
+ 'num_heads': 14,
6
+ 'dff': 2389,
7
+ 'Gcachelst': './predefined_G_LM_cache_list_IDENTITY_5layer_14head_64x64_paper.pkl',
8
+ 'input_vocab_size': 32000,
9
+ 'max_seq_len': 1024,
10
+ 'epochs': 1,
11
+ 'save_model_path': './PLDRv51G-106M-2-checkpoint',
12
+ 'warmup_steps': 2000,
13
+ 'lr_total_steps': 250000,
14
+ 'learning_rate': 0.0003,
15
+ 'lr_alpha': 0.1,
16
+ 'adamw_decay': 0.1,
17
+ 'activation': F.silu,
18
+ 'disable_amp': False,
19
+ 'auto_size_minimum': None,
20
+ 'disable_fsdp_mixed_precision': False,
21
+ 'fsdp_cpu_offload': False,
22
+ 'fsdp_sharding_strategy': 'HYBRID_SHARD',
23
+ 'backward_prefetch': 'PRE',
24
+ 'save_type': 'torch'}
README.md CHANGED
@@ -1,3 +1,67 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - text-generation
6
+ - large-language-model
7
+ - power-law-decoder-representations
8
+ - power-law-graph-attention
9
+ - pldr-llm
10
+ - kv-cache
11
+ - g-cache
12
+ - kvg-cache
13
+ - pytorch
14
+ license: apache-2.0
15
+ datasets:
16
+ - tiiuae/falcon-refinedweb
17
+ ---
18
+
19
+ # PLDR-LLM-v51G-106M-2
20
+
21
+ ## Model Description
22
+
23
+ PLDR-LLM-v51G-106M-2 is a large language model from power law decoder representations with KV-cache and G-cache support, which is a new foundational language model architecture that utilizes power law graph attention to generate deductive and inductive outputs. This model has a parameter size of 106M. It refers to PLDRv51G-106M-2 whose architecture and training details are provided in Table 1 of the research paper titled [PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference](https://arxiv.org/abs/2502.13502).
24
+
25
+ ## Training data
26
+
27
+ PLDR-LLM-v51G-106M-2 was pretrained on the [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a publicly available English web dataset with extensive filtering and deduplication.
28
+
29
+ ## Training procedure
30
+
31
+ This model was trained for ~8B tokens on RefinedWeb over 250k steps per rank. It was trained autoregressively with cross-entropy loss.
32
+
33
+ ## Intended Use and Limitations
34
+
35
+ This model is intended to be used for research purposes. Given text as input prompt, it carries out next token prediction to generate continuation text. The context length for this model is 1024 tokens.
36
+
37
+ ### How to use
38
+
39
+ - The model checkpoint and tokenizer can be loaded into the PLDR-LLM framework to generate text as described in the code repository for training this model: [PLDR-LLM-with-KVG-cache](https://github.com/burcgokden/PLDR-LLM-with-KVG-cache).
40
+
41
+ ### LM Evaluation Harness Support
42
+
43
+ - The model can be used with a fork of LM-Evaluation-Harness Suite with PLDR-LLM with KV-cache and G-cache support: [lm-evaluation-harness-with-PLDR-LLM-kvg-cache](https://github.com/burcgokden/lm-evaluation-harness-with-PLDR-LLM-kvg-cache).
44
+
45
+ ### Limitations and Biases
46
+
47
+ Large Language Models may generate text that is profane, lewd, socially unacceptable or offensive based on the contents of the dataset it was pretrained. RefinedWeb is a dataset that is as toxic and biased as the Pile. Please see the papers for [RefinedWeb](https://arxiv.org/abs/2306.01116) and [the Pile](https://arxiv.org/pdf/2101.00027) for more information. Moreover, large language models are also susceptible to hallucinations and may generate text that contains incorrect, irrelevant or misleading information. Since it is very hard to expect the contents of generated text ahead of time, the output of the large language models need to be heavily moderated and curated to avoid undesired content to appear without warning.
48
+
49
+ ## Eval results
50
+
51
+ The evaluation results on benchmarks with zero-shot setting and their comparison to LLM models of similar size reported in the literature can be found in Tables 3-5 and 7 of the [research paper](https://arxiv.org/abs/2502.13502).
52
+
53
+ ### BibTeX entry and citation info
54
+
55
+ Please cite this model as:
56
+
57
+ ```bibtex
58
+ @misc{gokden2025pldrllmkvgcache,
59
+ title={PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference},
60
+ author={Burc Gokden},
61
+ year={2025},
62
+ eprint={2502.13502},
63
+ archivePrefix={arXiv},
64
+ primaryClass={cs.CL},
65
+ url={https://arxiv.org/abs/2502.13502},
66
+ }
67
+ ```
predefined_G_LM_cache_list_IDENTITY_5layer_14head_64x64_paper.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4e8eb98538109cb263369958fa73fcaa30773435fc0a25d4f13d957942f772a8
3
+ size 1148520
refinedweb-tokenizer-pldrllm-kvg-cache-paper.tar.gz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4485f0cb53da6c9a25d99153b70f52c2e643789b54786bed4168f11e83091818
3
+ size 616297