|
--- |
|
language: en |
|
tags: |
|
- license |
|
- sentence-classification |
|
- scancode |
|
- license-compliance |
|
license: apache-2.0 |
|
datasets: |
|
- bookcorpus |
|
- wikipedia |
|
- scancode-rules |
|
version: 1.0 |
|
--- |
|
|
|
# `lic-class-scancode-bert-base-cased-L32-1` |
|
|
|
## Intended Use |
|
|
|
This model is intended to be used for Sentence Classification which is used for results |
|
analysis in [`scancode-results-analyzer`](https://github.com/nexB/scancode-results-analyzer). |
|
|
|
`scancode-results-analyzer` helps detect faulty scans in [`scancode-toolkit`](https://github.com/nexB/scancode-results-analyzer) by using statistics and nlp modeling, among other tools, |
|
to make Scancode better. |
|
|
|
## How to Use |
|
|
|
Refer [quickstart](https://github.com/nexB/scancode-results-analyzer#quickstart---local-machine) section in `scancode-results-analyzer` documentation, for installing and getting started. |
|
|
|
- [Link to Code](https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py) |
|
|
|
Then in `NLPModelsPredict` class, function `predict_basic_lic_class` uses this classifier to |
|
predict sentances as either valid license tags or false positives. |
|
|
|
## Limitations and Bias |
|
|
|
As this model is a fine-tuned version of the [`bert-base-cased`](https://huggingface.co/bert-base-cased) model, |
|
it has the same biases, but as the task it is fine-tuned to is a very specific task |
|
(license text/notice/tag/referance) without those intended biases, it's safe to assume |
|
those don't apply at all here. |
|
|
|
## Training and Fine-Tuning Data |
|
|
|
The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers). |
|
|
|
Then this `bert-base-cased` model was fine-tuned on Scancode Rule texts, specifically |
|
trained in the context of sentence classification, where the four classes are |
|
|
|
- License Text |
|
- License Notice |
|
- License Tag |
|
- License Referance |
|
|
|
## Training Procedure |
|
|
|
For fine-tuning procedure and training, refer `scancode-results-analyzer` code. |
|
|
|
- [Link to Code](https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py) |
|
|
|
In `NLPModelsTrain` class, function `prepare_input_data_false_positive` prepares the |
|
training data. |
|
|
|
In `NLPModelsTrain` class, function `train_basic_false_positive_classifier` fine-tunes |
|
this classifier. |
|
|
|
1. Model - [BertBaseCased](https://huggingface.co/bert-base-cased) (Weights 0.5 GB) |
|
2. Sentence Length - 32 |
|
3. Labels - 4 (License Text/Notice/Tag/Referance) |
|
4. After 4 Epochs of Fine-Tuning with learning rate 2e-5 (60 secs each on an RTX 2060) |
|
|
|
Note: The classes aren't balanced. |
|
|
|
## Eval Results |
|
|
|
- Accuracy on the training data (90%) : 0.98 (+- 0.01) |
|
- Accuracy on the validation data (10%) : 0.84 (+- 0.01) |
|
|
|
## Further Work |
|
|
|
1. Apllying Splitting/Aggregation Strategies |
|
2. Data Augmentation according to Vaalidation Errors |
|
3. Bigger/Better Suited Models |
|
|