ReaderLM-v2 / README.md

Update README.md

a6c0d30 verified 5 days ago

7.5 kB

	---
	pipeline_tag: text-generation
	language:
	- multilingual
	inference: false
	license: cc-by-nc-4.0
	library_name: transformers
	---

	<br><br>

	<p align="center">
	<img src="https://huggingface.co/datasets/jinaai/documentation-images/resolve/main/logo.webp" alt="Jina AI: Your Search Foundation, Supercharged!" width="150px">
	</p>

	<p align="center">
	<b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
	</p>

	[Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) \| [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing)

	# Intro

	Jina `ReaderLM-v2` is the second generation of Jina ReaderLM, a 1.5B parameter language model that converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling.

	`ReaderLM-v2` features several significant improvements:

	- Better Markdown Generation: `ReaderLM-v2` generates markdown with improved formatting and readability.
	- JSON Output: `ReaderLM-v2` can output JSON format, which is useful for downstream processing.
	- Longer Context Handling: `ReaderLM-v2` can handle up to 512K tokens of combined input and output length.
	- Multilingual Support: `ReaderLM-v2` supports 29 languages, including English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more.


	# Get Started

	## On Google Colab
	The easiest way to experience reader-lm is by running [our Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing),
	which demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example.
	The notebook is optimized for Colab's free T4 GPU tier and requires vllm and triton for acceleration and running.
	Feel free to test it with any website.
	For HTML-to-markdown tasks, simply input the raw HTML without any prefix instructions.
	However, JSON output and instruction-based extraction require specific prompt formatting as shown in the examples.

	## Local

	To use this model, you need to install `transformers`:

	```bash
	pip install transformers
	```


	### HTML to Markdown Conversion

	Then, you can use the model to convert HTML to Markdown as follows:

	```python
	# pip install transformers
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import re

	# (REMOVE <SCRIPT> to </script> and variations)
	SCRIPT_PATTERN = r'<[ ]script.?\/[ ]script[ ]>' # mach any char zero or more times

	# (REMOVE HTML <STYLE> to </style> and variations)
	STYLE_PATTERN = r'<[ ]style.?\/[ ]style[ ]>' # mach any char zero or more times

	# (REMOVE HTML <META> to </meta> and variations)
	META_PATTERN = r'<[ ]meta.?>' # mach any char zero or more times

	# (REMOVE HTML COMMENTS <!-- to --> and variations)
	COMMENT_PATTERN = r'<[ ]!--.?--[ ]*>' # mach any char zero or more times

	# (REMOVE HTML LINK <LINK> to </link> and variations)
	LINK_PATTERN = r'<[ ]link.?>' # mach any char zero or more times

	# (REPLACE base64 images)
	BASE64_IMG_PATTERN = r'<img[^>]+src="data:image/[^;]+;base64,[^"]+"[^>]*>'

	# (REPLACE <svg> to </svg> and variations)
	SVG_PATTERN = r'(<svg[^>]>)(.?)(<\/svg>)'

	def replace_svg(html: str, new_content: str = "this is a placeholder") -> str:
	return re.sub(
	SVG_PATTERN,
	lambda match: f"{match.group(1)}{new_content}{match.group(3)}",
	html,
	flags=re.DOTALL,
	)

	def replace_base64_images(html: str, new_image_src: str = "#") -> str:
	return re.sub(BASE64_IMG_PATTERN, f'<img src="{new_image_src}"/>', html)

	def clean_html(html: str, clean_svg: bool = False, clean_base64: bool = False):
	html = re.sub(SCRIPT_PATTERN, '', html, flags=(re.IGNORECASE \| re.MULTILINE \| re.DOTALL))
	html = re.sub(STYLE_PATTERN, '', html, flags=(re.IGNORECASE \| re.MULTILINE \| re.DOTALL))
	html = re.sub(META_PATTERN, '', html, flags=(re.IGNORECASE \| re.MULTILINE \| re.DOTALL))
	html = re.sub(COMMENT_PATTERN, '', html, flags=(re.IGNORECASE \| re.MULTILINE \| re.DOTALL))
	html = re.sub(LINK_PATTERN, '', html, flags=(re.IGNORECASE \| re.MULTILINE \| re.DOTALL))

	if clean_svg:
	html = replace_svg(html)

	if clean_base64:
	html = replace_base64_images(html)

	return html


	device = "cuda" # for GPU usage or "cpu" for CPU usage
	tokenizer = AutoTokenizer.from_pretrained("jinaai/ReaderLM-v2")
	model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)

	def create_prompt(text: str, tokenizer = None, instruction: str = None, schema: str = None) -> str:
	"""
	Create a prompt for the model with optional instruction and JSON schema.

	Args:
	text (str): The input HTML text
	tokenizer: The tokenizer to use
	instruction (str, optional): Custom instruction for the model
	schema (str, optional): JSON schema for structured extraction

	Returns:
	str: The formatted prompt
	"""

	if not instruction:
	instruction = "Extract the main content from the given HTML and convert it to Markdown format."

	if schema:
	instruction = 'Extract the specified information from a list of news threads and present it in a structured JSON format.'
	prompt = f"{instruction}\n```html\n{text}\n```\nThe JSON schema is as follows:```json{schema}```"
	else:
	prompt = f"{instruction}\n```html\n{text}\n```"

	messages = [
	{
	"role": "user",
	"content": prompt,
	}
	]

	return tokenizer.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)

	# example html content
	html = "<html><body><h1>Hello, world!</h1></body></html>"

	# clean the html content, remove scripts, styles, comments, etc.
	html = clean_html(html)

	input_prompt = create_prompt(html)

	print(input_prompt)

	inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
	outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)

	print(tokenizer.decode(outputs[0]))
	```

	You can also specify the content you want to extract from the HTML by providing a custom instruction.
	For example, if you want to extract the menu items from the HTML content, you can create a prompt like this:

	```python
	instruction = "Extract the menu items from the given HTML and convert it to Markdown format."
	input_prompt = create_prompt(html, instruction=instruction)

	inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
	outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)

	print(tokenizer.decode(outputs[0]))
	```

	### HTML to JSON Conversion

	To extract structured information from HTML content and convert it to JSON, you can create a prompt with a JSON schema.

	```python
	schema = """
	{
	"type": "object",
	"properties": {
	"title": {
	"type": "string"
	},
	"author": {
	"type": "string"
	},
	"date": {
	"type": "string"
	},
	"content": {
	"type": "string"
	}
	},
	"required": ["title", "author", "date", "content"]
	}
	"""

	input_prompt = create_prompt(html, schema=schema)

	inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
	outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)

	print(tokenizer.decode(outputs[0]))
	```


	## AWS Sagemaker & Azure Marketplace

	TBD