ReaderLM-v2 / README.md
numb3r3's picture
Update README.md
a6c0d30 verified
|
raw
history blame
7.5 kB
---
pipeline_tag: text-generation
language:
- multilingual
inference: false
license: cc-by-nc-4.0
library_name: transformers
---
<br><br>
<p align="center">
<img src="https://huggingface.co/datasets/jinaai/documentation-images/resolve/main/logo.webp" alt="Jina AI: Your Search Foundation, Supercharged!" width="150px">
</p>
<p align="center">
<b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
</p>
[Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing)
# Intro
Jina `ReaderLM-v2` is the second generation of Jina ReaderLM, a **1.5B** parameter language model that converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling.
`ReaderLM-v2` features several significant improvements:
- **Better Markdown Generation**: `ReaderLM-v2` generates markdown with improved formatting and readability.
- **JSON Output**: `ReaderLM-v2` can output JSON format, which is useful for downstream processing.
- **Longer Context Handling**: `ReaderLM-v2` can handle up to 512K tokens of combined input and output length.
- **Multilingual Support**: `ReaderLM-v2` supports 29 languages, including English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more.
# Get Started
## On Google Colab
The easiest way to experience reader-lm is by running [our Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing),
which demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example.
The notebook is optimized for Colab's free T4 GPU tier and requires vllm and triton for acceleration and running.
Feel free to test it with any website.
For HTML-to-markdown tasks, simply input the raw HTML without any prefix instructions.
However, JSON output and instruction-based extraction require specific prompt formatting as shown in the examples.
## Local
To use this model, you need to install `transformers`:
```bash
pip install transformers
```
### HTML to Markdown Conversion
Then, you can use the model to convert HTML to Markdown as follows:
```python
# pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import re
# (REMOVE <SCRIPT> to </script> and variations)
SCRIPT_PATTERN = r'<[ ]*script.*?\/[ ]*script[ ]*>' # mach any char zero or more times
# (REMOVE HTML <STYLE> to </style> and variations)
STYLE_PATTERN = r'<[ ]*style.*?\/[ ]*style[ ]*>' # mach any char zero or more times
# (REMOVE HTML <META> to </meta> and variations)
META_PATTERN = r'<[ ]*meta.*?>' # mach any char zero or more times
# (REMOVE HTML COMMENTS <!-- to --> and variations)
COMMENT_PATTERN = r'<[ ]*!--.*?--[ ]*>' # mach any char zero or more times
# (REMOVE HTML LINK <LINK> to </link> and variations)
LINK_PATTERN = r'<[ ]*link.*?>' # mach any char zero or more times
# (REPLACE base64 images)
BASE64_IMG_PATTERN = r'<img[^>]+src="data:image/[^;]+;base64,[^"]+"[^>]*>'
# (REPLACE <svg> to </svg> and variations)
SVG_PATTERN = r'(<svg[^>]*>)(.*?)(<\/svg>)'
def replace_svg(html: str, new_content: str = "this is a placeholder") -> str:
return re.sub(
SVG_PATTERN,
lambda match: f"{match.group(1)}{new_content}{match.group(3)}",
html,
flags=re.DOTALL,
)
def replace_base64_images(html: str, new_image_src: str = "#") -> str:
return re.sub(BASE64_IMG_PATTERN, f'<img src="{new_image_src}"/>', html)
def clean_html(html: str, clean_svg: bool = False, clean_base64: bool = False):
html = re.sub(SCRIPT_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
html = re.sub(STYLE_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
html = re.sub(META_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
html = re.sub(COMMENT_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
html = re.sub(LINK_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
if clean_svg:
html = replace_svg(html)
if clean_base64:
html = replace_base64_images(html)
return html
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained("jinaai/ReaderLM-v2")
model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)
def create_prompt(text: str, tokenizer = None, instruction: str = None, schema: str = None) -> str:
"""
Create a prompt for the model with optional instruction and JSON schema.
Args:
text (str): The input HTML text
tokenizer: The tokenizer to use
instruction (str, optional): Custom instruction for the model
schema (str, optional): JSON schema for structured extraction
Returns:
str: The formatted prompt
"""
if not instruction:
instruction = "Extract the main content from the given HTML and convert it to Markdown format."
if schema:
instruction = 'Extract the specified information from a list of news threads and present it in a structured JSON format.'
prompt = f"{instruction}\n```html\n{text}\n```\nThe JSON schema is as follows:```json{schema}```"
else:
prompt = f"{instruction}\n```html\n{text}\n```"
messages = [
{
"role": "user",
"content": prompt,
}
]
return tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
# example html content
html = "<html><body><h1>Hello, world!</h1></body></html>"
# clean the html content, remove scripts, styles, comments, etc.
html = clean_html(html)
input_prompt = create_prompt(html)
print(input_prompt)
inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
print(tokenizer.decode(outputs[0]))
```
You can also specify the content you want to extract from the HTML by providing a custom instruction.
For example, if you want to extract the menu items from the HTML content, you can create a prompt like this:
```python
instruction = "Extract the menu items from the given HTML and convert it to Markdown format."
input_prompt = create_prompt(html, instruction=instruction)
inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
print(tokenizer.decode(outputs[0]))
```
### HTML to JSON Conversion
To extract structured information from HTML content and convert it to JSON, you can create a prompt with a JSON schema.
```python
schema = """
{
"type": "object",
"properties": {
"title": {
"type": "string"
},
"author": {
"type": "string"
},
"date": {
"type": "string"
},
"content": {
"type": "string"
}
},
"required": ["title", "author", "date", "content"]
}
"""
input_prompt = create_prompt(html, schema=schema)
inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
print(tokenizer.decode(outputs[0]))
```
## AWS Sagemaker & Azure Marketplace
TBD