--- pipeline_tag: text-generation language: - multilingual inference: false license: cc-by-nc-4.0 library_name: transformers ---

Trained by Jina AI.

[Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing) # Intro Jina `ReaderLM-v2` is the second generation of Jina ReaderLM, a **1.5B** parameter language model that converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling. `ReaderLM-v2` features several significant improvements: - **Better Markdown Generation**: `ReaderLM-v2` generates markdown with improved formatting and readability. - **JSON Output**: `ReaderLM-v2` can output JSON format, which is useful for downstream processing. - **Longer Context Handling**: `ReaderLM-v2` can handle up to 512K tokens of combined input and output length. - **Multilingual Support**: `ReaderLM-v2` supports 29 languages, including English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more. # Get Started ## On Google Colab The easiest way to experience reader-lm is by running [our Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing), which demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example. The notebook is optimized for Colab's free T4 GPU tier and requires vllm and triton for acceleration and running. Feel free to test it with any website. For HTML-to-markdown tasks, simply input the raw HTML without any prefix instructions. However, JSON output and instruction-based extraction require specific prompt formatting as shown in the examples. ## Local To use this model, you need to install `transformers`: ```bash pip install transformers ``` ### HTML to Markdown Conversion Then, you can use the model to convert HTML to Markdown as follows: ```python # pip install transformers from transformers import AutoModelForCausalLM, AutoTokenizer import re # (REMOVE and variations) SCRIPT_PATTERN = r'<[ ]*script.*?\/[ ]*script[ ]*>' # mach any char zero or more times # (REMOVE HTML and variations) STYLE_PATTERN = r'<[ ]*style.*?\/[ ]*style[ ]*>' # mach any char zero or more times # (REMOVE HTML to and variations) META_PATTERN = r'<[ ]*meta.*?>' # mach any char zero or more times # (REMOVE HTML COMMENTS and variations) COMMENT_PATTERN = r'<[ ]*!--.*?--[ ]*>' # mach any char zero or more times # (REMOVE HTML LINK to and variations) LINK_PATTERN = r'<[ ]*link.*?>' # mach any char zero or more times # (REPLACE base64 images) BASE64_IMG_PATTERN = r']+src="data:image/[^;]+;base64,[^"]+"[^>]*>' # (REPLACE and variations) SVG_PATTERN = r'(]*>)(.*?)(<\/svg>)' def replace_svg(html: str, new_content: str = "this is a placeholder") -> str: return re.sub( SVG_PATTERN, lambda match: f"{match.group(1)}{new_content}{match.group(3)}", html, flags=re.DOTALL, ) def replace_base64_images(html: str, new_image_src: str = "#") -> str: return re.sub(BASE64_IMG_PATTERN, f'

', html) def clean_html(html: str, clean_svg: bool = False, clean_base64: bool = False): html = re.sub(SCRIPT_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL)) html = re.sub(STYLE_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL)) html = re.sub(META_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL)) html = re.sub(COMMENT_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL)) html = re.sub(LINK_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL)) if clean_svg: html = replace_svg(html) if clean_base64: html = replace_base64_images(html) return html device = "cuda" # for GPU usage or "cpu" for CPU usage tokenizer = AutoTokenizer.from_pretrained("jinaai/ReaderLM-v2") model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device) def create_prompt(text: str, tokenizer = None, instruction: str = None, schema: str = None) -> str: """ Create a prompt for the model with optional instruction and JSON schema. Args: text (str): The input HTML text tokenizer: The tokenizer to use instruction (str, optional): Custom instruction for the model schema (str, optional): JSON schema for structured extraction Returns: str: The formatted prompt """ if not instruction: instruction = "Extract the main content from the given HTML and convert it to Markdown format." if schema: instruction = 'Extract the specified information from a list of news threads and present it in a structured JSON format.' prompt = f"{instruction}\n```html\n{text}\n```\nThe JSON schema is as follows:```json{schema}```" else: prompt = f"{instruction}\n```html\n{text}\n```" messages = [ { "role": "user", "content": prompt, } ] return tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) # example html content html = "

Hello, world!

" # clean the html content, remove scripts, styles, comments, etc. html = clean_html(html) input_prompt = create_prompt(html) print(input_prompt) inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device) outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08) print(tokenizer.decode(outputs[0])) ``` You can also specify the content you want to extract from the HTML by providing a custom instruction. For example, if you want to extract the menu items from the HTML content, you can create a prompt like this: ```python instruction = "Extract the menu items from the given HTML and convert it to Markdown format." input_prompt = create_prompt(html, instruction=instruction) inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device) outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08) print(tokenizer.decode(outputs[0])) ``` ### HTML to JSON Conversion To extract structured information from HTML content and convert it to JSON, you can create a prompt with a JSON schema. ```python schema = """ { "type": "object", "properties": { "title": { "type": "string" }, "author": { "type": "string" }, "date": { "type": "string" }, "content": { "type": "string" } }, "required": ["title", "author", "date", "content"] } """ input_prompt = create_prompt(html, schema=schema) inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device) outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08) print(tokenizer.decode(outputs[0])) ``` ## AWS Sagemaker & Azure Marketplace TBD