--- license: cc-by-nc-4.0 datasets: - ai4privacy/pii-masking-400k metrics: - accuracy - f1 - precision - recall base_model: - microsoft/mdeberta-v3-base --- # Model Card for ai4privacy-mdeberta-v3-base-general-preprocessed This is a model aiming to detect the PII (Personal Identifiable Information), training by "The Last Ones" team on [NeuralWave](https://neuralwave.ch/#/) Hackthon. ## Model Details This model was fine-tuned from microsoft/mdeberta-v3-base on ai4privacy/pii-masking-400k dataset. We use the following arguments for training variable for finetuning: - learning_rate=3e-5, - per_device_train_batch_size=58, - per_device_eval_batch_size=58, - num_train_epochs=3, - weight_decay=0.01, - bf16=True, - seed=42 and other default hyperparameters of TrainingArguments. ## Training Data [ai4privacy/pii-masking-400k](https://huggingface.co/datasets/ai4privacy/pii-masking-400k) ## Preprocessing ```python def generate_sequence_labels(text, privacy_mask): # sort privacy mask by start position privacy_mask = sorted(privacy_mask, key=lambda x: x['start'], reverse=True) # replace sensitive pieces of text with labels for item in privacy_mask: label = item['label'] start = item['start'] end = item['end'] value = item['value'] # count the number of words in the value word_count = len(value.split()) # replace the sensitive information with the appropriate number of [label] placeholders replacement = " ".join([f"{label}" for _ in range(word_count)]) text = text[:start] + replacement + text[end:] words = text.split() # assign labels to each word labels = [] for word in words: match = re.search(r"(\w+)", word) # match any word character if match: label = match.group(1) if label in label_set: labels.append(label) else: # any other word is labeled as "O" labels.append("O") else: labels.append("O") return labels ``` ```python k = 0 def tokenize_and_align_labels(examples): words = [t.split() for t in examples["source_text"]] tokenized_inputs = tokenizer(words, truncation=True, is_split_into_words=True, max_length=512) source_labels = [ generate_sequence_labels(text, mask) for text, mask in zip(examples["source_text"], examples["privacy_mask"]) ] labels = [] valid_idx = [] for i, label in enumerate(source_labels): word_ids = tokenized_inputs.word_ids(batch_index=i) # map tokens to their respective word. previous_label = None label_ids = [-100] try: for word_idx in word_ids: if word_idx is None: continue elif label[word_idx] == "O": label_ids.append(label2id["O"]) continue elif previous_label == label[word_idx]: label_ids.append(label2id[f"I-{label[word_idx]}"]) else: label_ids.append(label2id[f"B-{label[word_idx]}"]) previous_label = label[word_idx] label_ids = label_ids[:511] + [-100] labels.append(label_ids) # print(word_ids) # print(label_ids) except: global k k += 1 # print(f"{word_idx = }") # print(f"{len(label) = }") labels.append([-100] * len(tokenized_inputs["input_ids"][i])) tokenized_inputs["labels"] = labels return tokenized_inputs ``` We use this two function to generate the source-text-level labels and then use it to align the tokens and token-level labels so that you can use any kinds of models and tokenizers to train on [ai4privacy/pii-masking-400k](https://huggingface.co/datasets/ai4privacy/pii-masking-400k). ## Evaluation ![image/png](https://cdn-uploads.huggingface.co/production/uploads/671e31b377035878c5f4082a/kzlMRqXBz80y63CmqDWDx.png) Some evaluation of this model on validation set (model 2) is shown in the table. ## Disclaimer Cooment of Non-Affiliation The publisher of this repository is not affiliate with Ai4Privacy and Ai Suisse SA. @NerualWave 2024 - *The Last Ones* Team.