OfficeTitleDis (Classical Chinese Office Title Disambiguation/Similarity)

This model has been fine-tuned using methodologies from the paper "LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models" by Abhishek Arora and Melissa Dell from Harvard University.

Model Description

This model is designed to find the top (N) most similar Classical Chinese office titles in a given data frame. Given an input DataFrame containing (K) office titles, the model outputs the top (N) most similar office titles in the input DataFrame for every office title.

Fine-tuning Data

The data used for fine-tuning this model is supported by the China Biographical Database (CBDB) at Harvard University. All office titles from the training data are from the periods of the Song, Ming, and Qing dynasties.

Usage

The following section demonstrates how to directly load the OfficeTitleDis model.

Please ensure that you have the necessary libraries installed and model downloaded in your Python environment. If not, you can install it using pip:

git lfs install
git clone https://huggingface.co/cbdb/OfficeTitleDis
pip install linktransformer
pip install hanziconv

Now, let's load our model and make some predictions:

# Import necessary libraries from linktransformer
import linktransformer as lt

# predict
df_lm_matched = lt.merge(df1, df2, merge_type='1:m', on="office_name", model="/content/OfficeTitleDis/model", left_on=None, right_on=None)
display(df_lm_matched.head())

Authors

Queenie Luo (queenieluo[at]g.harvard.edu)
Hongsu Wang
Peter Bol
CBDB Group

License

Except where otherwise noted, content on this repository is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.