ReaderLM-v2 / README.md
numb3r3's picture
Update README.md
a6c0d30 verified
|
raw
history blame
7.5 kB
metadata
pipeline_tag: text-generation
language:
  - multilingual
inference: false
license: cc-by-nc-4.0
library_name: transformers



Jina AI: Your Search Foundation, Supercharged!

Trained by Jina AI.

Blog | Colab

Intro

Jina ReaderLM-v2 is the second generation of Jina ReaderLM, a 1.5B parameter language model that converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling.

ReaderLM-v2 features several significant improvements:

  • Better Markdown Generation: ReaderLM-v2 generates markdown with improved formatting and readability.
  • JSON Output: ReaderLM-v2 can output JSON format, which is useful for downstream processing.
  • Longer Context Handling: ReaderLM-v2 can handle up to 512K tokens of combined input and output length.
  • Multilingual Support: ReaderLM-v2 supports 29 languages, including English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more.

Get Started

On Google Colab

The easiest way to experience reader-lm is by running our Colab notebook, which demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example. The notebook is optimized for Colab's free T4 GPU tier and requires vllm and triton for acceleration and running. Feel free to test it with any website. For HTML-to-markdown tasks, simply input the raw HTML without any prefix instructions. However, JSON output and instruction-based extraction require specific prompt formatting as shown in the examples.

Local

To use this model, you need to install transformers:

pip install transformers

HTML to Markdown Conversion

Then, you can use the model to convert HTML to Markdown as follows:

# pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import re

# (REMOVE <SCRIPT> to </script> and variations)
SCRIPT_PATTERN = r'<[ ]*script.*?\/[ ]*script[ ]*>'  # mach any char zero or more times

# (REMOVE HTML <STYLE> to </style> and variations)
STYLE_PATTERN = r'<[ ]*style.*?\/[ ]*style[ ]*>'  # mach any char zero or more times

# (REMOVE HTML <META> to </meta> and variations)
META_PATTERN = r'<[ ]*meta.*?>'  # mach any char zero or more times

# (REMOVE HTML COMMENTS <!-- to --> and variations)
COMMENT_PATTERN = r'<[ ]*!--.*?--[ ]*>'  # mach any char zero or more times

# (REMOVE HTML LINK <LINK> to </link> and variations)
LINK_PATTERN = r'<[ ]*link.*?>'  # mach any char zero or more times

# (REPLACE base64 images)
BASE64_IMG_PATTERN = r'<img[^>]+src="data:image/[^;]+;base64,[^"]+"[^>]*>'

# (REPLACE <svg> to </svg> and variations)
SVG_PATTERN = r'(<svg[^>]*>)(.*?)(<\/svg>)'

def replace_svg(html: str, new_content: str = "this is a placeholder") -> str:
    return re.sub(
        SVG_PATTERN,
        lambda match: f"{match.group(1)}{new_content}{match.group(3)}",
        html,
        flags=re.DOTALL,
    )

def replace_base64_images(html: str, new_image_src: str = "#") -> str:
    return re.sub(BASE64_IMG_PATTERN, f'<img src="{new_image_src}"/>', html)

def clean_html(html: str, clean_svg: bool = False, clean_base64: bool = False):
    html = re.sub(SCRIPT_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
    html = re.sub(STYLE_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
    html = re.sub(META_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
    html = re.sub(COMMENT_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
    html = re.sub(LINK_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

    if clean_svg:
        html = replace_svg(html)

    if clean_base64:
        html = replace_base64_images(html)

    return html


device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained("jinaai/ReaderLM-v2")
model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)

def create_prompt(text: str, tokenizer = None, instruction: str = None, schema: str = None) -> str:
    """
    Create a prompt for the model with optional instruction and JSON schema.

    Args:
        text (str): The input HTML text
        tokenizer: The tokenizer to use
        instruction (str, optional): Custom instruction for the model
        schema (str, optional): JSON schema for structured extraction

    Returns:
        str: The formatted prompt
    """

    if not instruction:
        instruction = "Extract the main content from the given HTML and convert it to Markdown format."

    if schema:
        instruction = 'Extract the specified information from a list of news threads and present it in a structured JSON format.'
        prompt = f"{instruction}\n```html\n{text}\n```\nThe JSON schema is as follows:```json{schema}```"
    else:
        prompt = f"{instruction}\n```html\n{text}\n```"

    messages = [
        {
            "role": "user",
            "content": prompt,
        }
    ]

    return tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

# example html content
html = "<html><body><h1>Hello, world!</h1></body></html>"

# clean the html content, remove scripts, styles, comments, etc.
html = clean_html(html)

input_prompt = create_prompt(html)

print(input_prompt)

inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)

print(tokenizer.decode(outputs[0]))

You can also specify the content you want to extract from the HTML by providing a custom instruction. For example, if you want to extract the menu items from the HTML content, you can create a prompt like this:

instruction = "Extract the menu items from the given HTML and convert it to Markdown format."
input_prompt = create_prompt(html, instruction=instruction)

inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)

print(tokenizer.decode(outputs[0]))

HTML to JSON Conversion

To extract structured information from HTML content and convert it to JSON, you can create a prompt with a JSON schema.

schema = """
{
  "type": "object",
  "properties": {
    "title": {
      "type": "string"
    },
    "author": {
      "type": "string"
    },
    "date": {
      "type": "string"
    },
    "content": {
      "type": "string"
    }
  },
  "required": ["title", "author", "date", "content"]
}
"""

input_prompt = create_prompt(html, schema=schema)

inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)

print(tokenizer.decode(outputs[0]))

AWS Sagemaker & Azure Marketplace

TBD