|
--- |
|
license: apache-2.0 |
|
metrics: |
|
- accuracy |
|
base_model: |
|
- meta-llama/Llama-3.1-8B-Instruct |
|
datasets: |
|
- stanfordnlp/nnetnav-wa |
|
--- |
|
|
|
# Model Card for Llama8b-NNetNav-WA |
|
|
|
<!-- Provide a quick summary of what the model is/does. [Optional] --> |
|
LLama8b-NNetNav-WA is a [LLama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model that is instruct-tuned with [NNetNav-WA](https://huggingface.co/datasets/stanfordnlp/nnetnav-wa) data collected via unsupervised exploration on [WebArena](http://webarena.dev) websites, with a larger [LLama-3.1-70B](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) model. |
|
|
|
More details about this model can be found in our paper: [NNetNav: Unsupervised Learning of Browser Agents Through Environment Interaction in the Wild](https://arxiv.org/abs/2410.02907). |
|
|
|
|
|
## Table of Contents |
|
|
|
- [Model Card for Llama8b-NNetNav-WA](#model-card-for--model_id-) |
|
- [Table of Contents](#table-of-contents) |
|
- [Model Details](#model-details) |
|
- [Results on Web-Agent Benchmarks](#results-on-benchmarks) |
|
- [Bias, Risks, and Limitations](#bias-risks-and-limitations) |
|
- [Training Details](#training-details) |
|
- [Training Data](#training-data) |
|
- [Training Procedure](#training-procedure) |
|
- [Environmental Impact](#environmental-impact) |
|
- [Technical Specifications](#technical-specifications) |
|
- [Hardware](#hardware) |
|
- [Software](#software) |
|
- [Model Card Authors [optional]](#model-card-authors-optional) |
|
- [Model Card Contact](#model-card-contact) |
|
- [How to Get Started with the Model](#how-to-get-started-with-the-model) |
|
|
|
## Model Details |
|
This model is intended to be used as a **web-agent** i.e. given an instruction such as _Upvote the post by user smurty123 on subreddit r/LocalLLaMA_, and a web-url _reddit.com_, the model can perform the task by executing a sequence of actions. |
|
|
|
<!-- Provide a longer summary of what this model is/does. --> |
|
The action space of the model is as follows: |
|
```plaintext |
|
Page Operation Actions: |
|
`click [id]`: This action clicks on an element with a specific id on the webpage. |
|
`type [id] [content] [press_enter_after=0|1]`: Use this to type the content into the field with id. By default, the "Enter" key is pressed after typing unless press_enter_after is set to 0. |
|
`hover [id]`: Hover over an element with id. |
|
`press [key_comb]`: Simulates the pressing of a key combination on the keyboard (e.g., Ctrl+v). |
|
`scroll [down|up]`: Scroll the page up or down. |
|
|
|
Tab Management Actions: |
|
`new_tab`: Open a new, empty browser tab. |
|
`tab_focus [tab_index]`: Switch the browser's focus to a specific tab using its index. |
|
`close_tab`: Close the currently active tab. |
|
|
|
URL Navigation Actions: |
|
`goto [url]`: Navigate to a specific URL. |
|
`go_back`: Navigate to the previously viewed page. |
|
`go_forward`: Navigate to the next page (if a previous 'go_back' action was performed). |
|
|
|
Completion Action: |
|
`stop [answer]`: Issue this action when you believe the task is complete. If the objective is to find a text-based answer, provide the answer in the bracket. If you believe the task is impossible to complete, provide the answer as "N/A" in the bracket. |
|
``` |
|
|
|
## Results on Benchmarks |
|
|
|
This model gets the following results on WebArena and WebVoyager: |
|
|
|
| Model | WebArena (SR) | WebVoyager (SR) | |
|
|------------------------|--------------:|---------------:| |
|
| **GPT-4** | **14.1** | **33.5** | |
|
| **llama8b-nnetnav-wa** | **16.3** | **28.1** | |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
### **Bias** |
|
As with all ML models, **Llama8b-NNetNav-WA** inherits biases from its training data. Since the dataset is collected via unsupervised exploration on self-hosted WebArena websites, it will reflect biases present in website structures, navigation flows, and content representations. |
|
|
|
- **Selection Bias:** The model is trained on Self-hosted websites that mimic reddit, github, google maps, simple e-commerce websites and CMS websites. This model is likely to struggle with websites with modern layouts seen on live websites. |
|
- **Demographic Bias:** WebArena self-hosted websites over-represent Western English-speaking users, and the model may perform worse on non-English or culturally distinct websites. |
|
- Example: A model trained mostly on U.S. e-commerce sites may navigate amazon.com effectively but may struggle with Flipkart (India) or Rakuten (Japan). |
|
|
|
If you are interested in training a NNetNav based agent for your own domain, please check out our [codebase](https://github.com/MurtyShikhar/NNetnav). Or if you're interested in a model that has been shown to work well on a variety of live websites, please check out [LLama8b-NNetNav-Live](https://huggingface.co/stanfordnlp/llama8b-nnetnav-live) |
|
|
|
### **Risks** |
|
#### **1. Unintended Actions** |
|
The model operates by executing web actions based on textual observation spaces, which may lead to unintended consequences when dealing with ambiguous or poorly structured websites. |
|
|
|
- If instructed to "delete all spam messages in my inbox," but the website has unusual button placement in the AXTree, the model might mistakenly delete important emails instead. |
|
- If asked to "buy the cheapest laptop on Amazon," the model might select an accessory instead of an actual laptop if the AXTree of the listing page has misleading layout |
|
|
|
#### **2. Security & Privacy Risks** |
|
Since the model interacts with external web content, there are significant risks related to unintentional data exposure, credential leaks, and interaction with harmful content. |
|
|
|
- If asked to "log into my Gmail and check unread emails," the model may type and submit credentials without realizing it, potentially exposing passwords. |
|
- A user asking the model to "search for free software downloads" might inadvertently lead to interactions with phishing or malware-hosting sites. |
|
|
|
#### **3. Adversarial Manipulation** |
|
Malicious websites can deceive the model by using **dark patterns**—UI/UX tricks that mislead users (or bots). |
|
|
|
- A fraudulent website may create **fake "Close" buttons** in the AXTree that actually trigger **downloads or pop-ups**. The model, thinking it's closing a window, may instead **click a malicious link**. |
|
- If asked to "unsubscribe from a newsletter," but the page uses **misleading button labels** in the AXTree (e.g., "Unsubscribe" actually means "Resubscribe"), the model could perform the opposite action. |
|
|
|
#### **4. Legal & Ethical Considerations** |
|
Web navigation often involves handling user-generated content, news, and e-commerce transactions, all of which pose ethical and legal challenges. |
|
|
|
- If instructed to "find the latest election results," the model might click on a misleading news source, potentially spreading misinformation. |
|
- If asked to "find the cheapest flight ticket," it could unintentionally violate terms of service by scraping restricted airline data. |
|
|
|
### **Limitations** |
|
#### **1. Generalization to Unseen Websites** |
|
This model is trained via interaction on 5 self-hosted WebArena websites, and is known to struggle on real, live websites. Please check out [LLama8b-NNetNav-Live](https://huggingface.co/stanfordnlp/llama8b-nnetnav-live) for a model that performs better on live websites. |
|
|
|
#### **2. Instruction Sensitivity** |
|
Vague instructions can lead to unintended actions. |
|
|
|
- "Find me the best laptop for gaming" is **subjective**, and the model might select a **random option** instead of following some criteria (e.g., GPU, refresh rate). |
|
|
|
#### **3. Performance on Long-Horizon Tasks** |
|
The model may struggle when tasks require **deep memory retention, complex multi-step planning, or backtracking**. |
|
|
|
- *Example:* When booking a hotel on a travel website, the model might navigate **through multiple filters and options** but forget previous selections when reaching the checkout page. |
|
|
|
#### **4. Token Limitations** |
|
The model's **maximum sequence length of 20k tokens** limits its ability to handle long, continuous web interactions. |
|
|
|
- *Example:* When filling a very long multi-step form, the model might forget earlier responses, leading to errors. |
|
|
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
TODO |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
This model was trained with SFT on the [NNetnav-WA](https://huggingface.co/datasets/stanfordnlp/nnetnav-wa) dataset, which is comprised of synthetic demonstrations entirely from self-hosted websites. |
|
|
|
### Training Procedure |
|
|
|
This model was trained for 2 epochs (roughly 4k gradient steps) with a batch size of 128, and a maximum sequence length of 20000. |
|
|
|
## Environmental Impact |
|
|
|
- **Hardware Type:** 4 H100 GPUs (80G) |
|
- **Hours used:** Roughly 2 days. |
|
- **Cloud Provider:** Stanford compute. |
|
- **Compute Region:** Stanford energy grid. |
|
|
|
## Technical Specifications |
|
|
|
### Hardware |
|
|
|
This model was trained on 4 H100s. |
|
|
|
### Software |
|
|
|
This model was fine-tuned with [Open-Instruct](https://github.com/allenai/open-instruct/tree/main) |
|
|
|
|
|
## Model Card Authors |
|
|
|
Shikhar Murty |
|
|
|
## Model Card Contact |
|
|
|
[email protected] |
|
|