File size: 4,579 Bytes
cb57c2d
 
eaec80c
 
 
 
 
 
 
cb57c2d
 
 
d65aebb
cb57c2d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eaec80c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
license: apache-2.0
datasets:
- agentlans/readability
language:
- en
base_model:
- microsoft/deberta-v3-xsmall
pipeline_tag: text-classification
---
# DeBERTa V3 Base and XSmall for Readability Assessment

This is one of two fine-tuned versions of DeBERTa V3 (Base and XSmall) for assessing text readability,

## Model Details

- **Architecture:** DeBERTa V3 (Base and XSmall variants)
- **Task:** Regression (Readability Assessment)
- **Training Data:** 105 000 paragraphs from diverse sources
- **Input:** Text
- **Output:** Estimated U.S. grade level for text comprehension
  - higher values indicate more complex text

## Performance

Root mean squared error (RMSE) on 20% held-out validation set:

| Model | RMSE |
|-------|------|
| Base  | 0.5038 |
| XSmall| 0.6296 |

## Training Data

The models were trained on a diverse dataset of 105 000 paragraphs with the following characteristics:

- Character length: 50 to 2,000
- Interquartile Range (IQR) of readability grades < 1

**Sources:**
- HuggingFace's Fineweb-Edu
- Ronen Eldan's TinyStories
- Wikipedia-2023-11-embed-multilingual-v3 (English only)
- ArXiv Abstracts-2021

For more details, please see [agentlans/readability](https://huggingface.co/datasets/agentlans/readability).

## Usage

Example on how to use the model:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name="agentlans/deberta-v3-xsmall-readability-v2"

# Put model on GPU or else CPU
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

def readability(text):
    """Processes the text using the model and returns its logits.
    In this case, it's reading grade level in years of education
    (the higher the number, the harder it is to read the text)."""
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
    with torch.no_grad():
        logits = model(**inputs).logits.squeeze().cpu()
    return logits.tolist()

# Example usage
texts = [x.strip() for x in """
The cat sat on the mat.
I like to eat pizza and ice cream for dinner.
The quick brown fox jumps over the lazy dog.
Students must complete their homework before watching television.
The intricate ecosystem of the rainforest supports a diverse array of flora and fauna.
Quantum mechanics describes the behavior of matter and energy at the molecular, atomic, nuclear, and even smaller microscopic levels.
The socioeconomic ramifications of globalization have led to unprecedented levels of interconnectedness and cultural homogenization.
The ontological argument for the existence of God posits that the very concept of a maximally great being necessitates its existence in reality.
""".strip().split("\n")]

result = readability(texts)
for x, s in zip(texts, result):
    print(f"Text: {x}\nReadability grade: {round(s, 2)}\n")
```

Example output for `xsmall` size model:
```
Text: The cat sat on the mat.
Readability grade: 2.55

Text: I like to eat pizza and ice cream for dinner.
Readability grade: 3.79

Text: The quick brown fox jumps over the lazy dog.
Readability grade: 3.71

Text: Students must complete their homework before watching television.
Readability grade: 10.11

Text: The intricate ecosystem of the rainforest supports a diverse array of flora and fauna.
Readability grade: 9.76

Text: Quantum mechanics describes the behavior of matter and energy at the molecular, atomic, nuclear, and even smaller microscopic levels.
Readability grade: 17.09

Text: The socioeconomic ramifications of globalization have led to unprecedented levels of interconnectedness and cultural homogenization.
Readability grade: 18.56

Text: The ontological argument for the existence of God posits that the very concept of a maximally great being necessitates its existence in reality.
Readability grade: 17.31
```

## Limitations

- English language only
- Performance may vary for texts significantly different from the training data
- Output is based on statistical patterns and may not always align with human judgment
- Readability is assessed purely on textual features, not considering factors like subject familiarity or cultural context

## Ethical Considerations

- Should not be used as the sole determinant of text suitability for specific audiences
- Results may reflect biases present in the training data sources
- Care should be taken when using these models in educational or publishing contexts