--- license: cc-by-nc-4.0 pipeline_tag: fill-mask widget: - text: >- The most trusted online bulk seller in the world -Consistent 90%+ purity -All shipments straight off the brick. 250-500g orders received a portion of a stamped brick. At 1000g, full stamped bricks are shipped. -We utilize the best packaging equipment available for the highest level of stealth and security. extra_gated_prompt: >- DarkBERT is available for access upon request. Users may submit their request using the form below, which includes the **name of the user**, the **user’s institution**, the **user’s email address that matches the institution** *(we especially emphasize this part; any non-academic addresses such as gmail, tutanota, protonmail, etc. are automatically rejected as it makes it difficult for us to verify your affiliation to the institution)*, and the **purpose of usage** *(in as much detail as possible)*. By requesting and downloading DarkBERT, the user agrees to the following: the user acknowledges that the use of this model is restricted to research and/or academic purposes only. Access to the model will be granted after the request is manually reviewed. A request may be declined if it does not sufficiently describe research purposes that follow the ACM Code of Ethics (https://www.acm.org/code-of-ethics). The information provided by the requesting user will not be used in any way except for sending the dataset to the user and keeping track of request history for DarkBERT. By requesting for the model, the user agrees to our collection of the provided information. This model shall only be used for non-profit research purposes and in a manner consistent with fair practice. Do not redistribute this dataset to others. The user should indicate the source of this model (found at the bottom of the page) when using or citing the model in their research or article. extra_gated_fields: Full Name: text Affiliated Institution / Organization / University: text E-mail (must match affiliation, generic domains such as gmail not allowed): text Position (ex doctoral student, professor, researcher, etc): text Purpose of Usage (Please describe the purpose of usage in as much detail as possible): text Country: text I have read the conditions and agree to use this model for ethical, non-commercial use ONLY: checkbox A request cannot be modified once submitted; I understand that requests with incomplete, insufficient, or inaccurate information will be rejected: checkbox language: - en --- # DarkBERT A BERT-like model pretrained with a Dark Web corpus as described in "DarkBERT: A Language Model for the Dark Side of the Internet (ACL 2023)" # Conditions DarkBERT is available for access upon request. Users may submit their request using the form below, which includes the **name of the user**, the **user’s institution**, the **user’s email address that matches the institution** (we especially emphasize this part; any non-academic addresses such as gmail, tutanota, protonmail, etc. are automatically rejected as it makes it difficult for us to verify your affiliation to the institution) and the **purpose of usage**. By requesting and downloading DarkBERT, the user agrees to the following: the user acknowledges that the use of this model is restricted to research and/or academic purposes only. Access to the model will be granted after the request is manually reviewed. A request may be declined if it does not sufficiently describe research purposes that follow the ACM Code of Ethics (https://www.acm.org/code-of-ethics). The information provided by the requesting user will not be used in any way except for sending the dataset to the user and keeping track of request history for DarkBERT. By requesting for the model, the user agrees to our collection of the provided information. This model shall only be used for non-profit research purposes and in a manner consistent with fair practice. Do not redistribute this dataset to others. The user should indicate the source of this model (found at the bottom of the page) when using or citing the model in their research or article. ## What is included? The preprocessed version of DarkBERT. Benchmark datasets in the `benchmark-dataset` directory. ## Sample Usage ```python >>> from transformers import pipeline >>> folder_dir = "DarkBERT" >>> unmasker = pipeline('fill-mask', model=folder_dir) >>> unmasker("RagnarLocker, LockBit, and REvil are types of .") [{'score': 0.4952353239059448, 'token': 25346, 'token_str': ' ransomware', 'sequence': 'RagnarLocker, LockBit, and REvil are types of ransomware.'}, {'score': 0.04661545157432556, 'token': 16886, 'token_str': ' malware', 'sequence': 'RagnarLocker, LockBit, and REvil are types of malware.'}, {'score': 0.04217657446861267, 'token': 28811, 'token_str': ' wallets', 'sequence': 'RagnarLocker, LockBit, and REvil are types of wallets.'}, {'score': 0.028982503339648247, 'token': 2196, 'token_str': ' drugs', 'sequence': 'RagnarLocker, LockBit, and REvil are types of drugs.'}, {'score': 0.020001502707600594, 'token': 11344, 'token_str': ' hackers', 'sequence': 'RagnarLocker, LockBit, and REvil are types of hackers.'}] >>> from transformers import AutoModel, AutoTokenizer >>> model = AutoModel.from_pretrained(folder_dir) >>> tokenizer = AutoTokenizer.from_pretrained(folder_dir) >>> text = "Recent research has suggested that there are clear differences in the language used in the Dark Web compared to that of the Surface Web." >>> encoded = tokenizer(text, return_tensors="pt") >>> output = model(**encoded) >>> output[0].shape torch.Size([1, 27, 768]) ``` ## Citation If you are using the DarkBERT model, please cite the following paper accordingly: ``` Youngjin Jin, Eugene Jang, Jian Cui, Jin-Woo Chung, Yongjae Lee, and Seungwon Shin. 2023. DarkBERT: A Language Model for the Dark Side of the Internet. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7515–7533, Toronto, Canada. Association for Computational Linguistics. ```