[Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing)
# Intro
Jina `ReaderLM-v2` is the second generation of Jina ReaderLM, a **1.5B** parameter language model that converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling.
`ReaderLM-v2` features several significant improvements:
- **Better Markdown Generation**: `ReaderLM-v2` generates markdown with improved formatting and readability.
- **JSON Output**: `ReaderLM-v2` can output JSON format, which is useful for downstream processing.
- **Longer Context Handling**: `ReaderLM-v2` can handle up to 512K tokens of combined input and output length.
- **Multilingual Support**: `ReaderLM-v2` supports 29 languages, including English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more.
# Get Started
## On Google Colab
The easiest way to experience reader-lm is by running [our Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing),
which demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example.
The notebook is optimized for Colab's free T4 GPU tier and requires vllm and triton for acceleration and running.
Feel free to test it with any website.
For HTML-to-markdown tasks, simply input the raw HTML without any prefix instructions.
However, JSON output and instruction-based extraction require specific prompt formatting as shown in the examples.
## Local
To use this model, you need to install `transformers`:
```bash
pip install transformers
```
### HTML to Markdown Conversion
Then, you can use the model to convert HTML to Markdown as follows:
```python
# pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import re
# (REMOVE and variations)
SCRIPT_PATTERN = r'<[ ]*script.*?\/[ ]*script[ ]*>' # mach any char zero or more times
# (REMOVE HTML and variations)
STYLE_PATTERN = r'<[ ]*style.*?\/[ ]*style[ ]*>' # mach any char zero or more times
# (REMOVE HTML to and variations)
META_PATTERN = r'<[ ]*meta.*?>' # mach any char zero or more times
# (REMOVE HTML COMMENTS and variations)
COMMENT_PATTERN = r'<[ ]*!--.*?--[ ]*>' # mach any char zero or more times
# (REMOVE HTML LINK to and variations)
LINK_PATTERN = r'<[ ]*link.*?>' # mach any char zero or more times
# (REPLACE base64 images)
BASE64_IMG_PATTERN = r']+src="data:image/[^;]+;base64,[^"]+"[^>]*>'
# (REPLACE and variations)
SVG_PATTERN = r'(