language: en
tags:
- license
- sentence-classification
- scancode
- license-compliance
license: apache-2.0
datasets:
- bookcorpus
- wikipedia
- scancode-rules
version: 1
lic-class-scancode-bert-base-cased-L32-1
Intended Use
This model is intended to be used for Sentence Classification which is used for results
analysis in scancode-results-analyzer
.
scancode-results-analyzer
helps detect faulty scans in scancode-toolkit
by using statistics and nlp modeling, among other tools,
to make Scancode better.
How to Use
Refer quickstart section in scancode-results-analyzer
documentation, for installing and getting started.
Then in NLPModelsPredict
class, function predict_basic_lic_class
uses this classifier to
predict sentances as either valid license tags or false positives.
Limitations and Bias
As this model is a fine-tuned version of the bert-base-cased
model,
it has the same biases, but as the task it is fine-tuned to is a very specific task
(license text/notice/tag/referance) without those intended biases, it's safe to assume
those don't apply at all here.
Training and Fine-Tuning Data
The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers).
Then this bert-base-cased
model was fine-tuned on Scancode Rule texts, specifically
trained in the context of sentence classification, where the four classes are
- License Text
- License Notice
- License Tag
- License Referance
Training Procedure
For fine-tuning procedure and training, refer scancode-results-analyzer
code.
In NLPModelsTrain
class, function prepare_input_data_false_positive
prepares the
training data.
In NLPModelsTrain
class, function train_basic_false_positive_classifier
fine-tunes
this classifier.
- Model - BertBaseCased (Weights 0.5 GB)
- Sentence Length - 32
- Labels - 4 (License Text/Notice/Tag/Referance)
- After 4 Epochs of Fine-Tuning with learning rate 2e-5 (60 secs each on an RTX 2060)
Note: The classes aren't balanced.
Eval Results
- Accuracy on the training data (90%) : 0.98 (+- 0.01)
- Accuracy on the validation data (10%) : 0.84 (+- 0.01)
Further Work
- Apllying Splitting/Aggregation Strategies
- Data Augmentation according to Vaalidation Errors
- Bigger/Better Suited Models