|
--- |
|
language: |
|
- en |
|
widget: |
|
- text: theft 3 |
|
- text: forgery |
|
- text: unlawful possession short-barreled shotgun |
|
- text: criminal trespass 2nd degree |
|
- text: eluding a police vehicle |
|
- text: upcs synthetic narcotic |
|
license: apache-2.0 |
|
--- |
|
|
|
# ROTA |
|
## Rapid Offense Text Autocoder |
|
|
|
[![HuggingFace Models](https://img.shields.io/badge/%F0%9F%A4%97%20models-2021.05.18.15-blue)](https://huggingface.co/rti-international/rota) |
|
[![HuggingFace Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20spaces-2021.05.18.15-blue)](https://huggingface.co/spaces/rti-international/rota-app) |
|
[![GitHub Model Release](https://img.shields.io/github/v/release/RTIInternational/rota?logo=github)](https://github.com/RTIInternational/rota) |
|
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4770492.svg)](https://doi.org/10.5281/zenodo.4770492) |
|
|
|
ROTA Application hosted on Hugging Face Spaces: https://huggingface.co/spaces/rti-international/rota-app |
|
|
|
Criminal justice research often requires conversion of free-text offense descriptions into overall charge categories to aid analysis. For example, the free-text offense of "eluding a police vehicle" would be coded to a charge category of "Obstruction - Law Enforcement". Since free-text offense descriptions aren't standardized and often need to be categorized in large volumes, this can result in a manual and time intensive process for researchers. ROTA is a machine learning model for converting offense text into offense codes. |
|
|
|
Currently ROTA predicts the *Charge Category* of a given offense text. A *charge category* is one of the headings for offense codes in the [2009 NCRP Codebook: Appendix F](https://www.icpsr.umich.edu/web/NACJD/studies/30799/datadocumentation#). |
|
|
|
The model was trained on [publicly available data](https://web.archive.org/web/20201021001250/https://www.icpsr.umich.edu/web/pages/NACJD/guides/ncrp.html) from a crosswalk containing offenses from all 50 states combined with three additional hand-labeled offense text datasets. |
|
|
|
<details> |
|
<summary>Charge Category Example</summary> |
|
<img src="https://i.ibb.co/xLsrzmV/charge-category-example.png" width="500"> |
|
</details> |
|
|
|
### Data Preprocessing |
|
|
|
The input text is standardized through a series of preprocessing steps. The text is first passed through a sequence of 500+ case-insensitive regular expressions that identify common misspellings and abbreviations and expand the text to a more full, correct English text. Some data-specific prefixes and suffixes are then removed from the text -- e.g. some states included a statute as a part of the text. Finally, punctuation (excluding dollar signs) are removed from the input, multiple spaces between words are removed, and the text is lowercased. |
|
|
|
## Cross-Validation Performance |
|
|
|
This model was evaluated using 3-fold cross validation. Except where noted, numbers presented below are the mean value across the 3 folds. |
|
|
|
The model in this repository is trained on all available data. Because of this, you can typically expect production performance to be (unknowably) better than the numbers presented below. |
|
|
|
### Overall Metrics |
|
|
|
| Metric | Value | |
|
| -------- | ----- | |
|
| Accuracy | 0.934 | |
|
| MCC | 0.931 | |
|
|
|
|
|
|
|
| Metric | precision | recall | f1-score | |
|
| --------- | --------- | ------ | -------- | |
|
| macro avg | 0.811 | 0.786 | 0.794 | |
|
|
|
|
|
*Note*: These are the average of the values *per fold*, so *macro avg* is the average of the macro average of all categories per fold. |
|
|
|
### Per-Category Metrics |
|
|
|
| Category | precision | recall | f1-score | support | |
|
| ------------------------------------------------------ | --------- | ------ | -------- | ------- | |
|
| AGGRAVATED ASSAULT | 0.954 | 0.954 | 0.954 | 4085 | |
|
| ARMED ROBBERY | 0.961 | 0.955 | 0.958 | 1021 | |
|
| ARSON | 0.946 | 0.954 | 0.95 | 344 | |
|
| ASSAULTING PUBLIC OFFICER | 0.914 | 0.905 | 0.909 | 588 | |
|
| AUTO THEFT | 0.962 | 0.962 | 0.962 | 1660 | |
|
| BLACKMAIL/EXTORTION/INTIMIDATION | 0.872 | 0.871 | 0.872 | 627 | |
|
| BRIBERY AND CONFLICT OF INTEREST | 0.784 | 0.796 | 0.79 | 216 | |
|
| BURGLARY | 0.979 | 0.981 | 0.98 | 2214 | |
|
| CHILD ABUSE | 0.805 | 0.78 | 0.792 | 139 | |
|
| COCAINE OR CRACK VIOLATION OFFENSE UNSPECIFIED | 0.827 | 0.815 | 0.821 | 47 | |
|
| COMMERCIALIZED VICE | 0.818 | 0.788 | 0.802 | 666 | |
|
| CONTEMPT OF COURT | 0.982 | 0.987 | 0.984 | 2952 | |
|
| CONTRIBUTING TO DELINQUENCY OF A MINOR | 0.544 | 0.333 | 0.392 | 50 | |
|
| CONTROLLED SUBSTANCE - OFFENSE UNSPECIFIED | 0.864 | 0.791 | 0.826 | 280 | |
|
| COUNTERFEITING (FEDERAL ONLY) | 0 | 0 | 0 | 2 | |
|
| DESTRUCTION OF PROPERTY | 0.97 | 0.968 | 0.969 | 2560 | |
|
| DRIVING UNDER INFLUENCE - DRUGS | 0.567 | 0.603 | 0.581 | 34 | |
|
| DRIVING UNDER THE INFLUENCE | 0.951 | 0.946 | 0.949 | 2195 | |
|
| DRIVING WHILE INTOXICATED | 0.986 | 0.981 | 0.984 | 2391 | |
|
| DRUG OFFENSES - VIOLATION/DRUG UNSPECIFIED | 0.903 | 0.911 | 0.907 | 3100 | |
|
| DRUNKENNESS/VAGRANCY/DISORDERLY CONDUCT | 0.856 | 0.861 | 0.858 | 380 | |
|
| EMBEZZLEMENT | 0.865 | 0.759 | 0.809 | 100 | |
|
| EMBEZZLEMENT (FEDERAL ONLY) | 0 | 0 | 0 | 1 | |
|
| ESCAPE FROM CUSTODY | 0.988 | 0.991 | 0.989 | 4035 | |
|
| FAMILY RELATED OFFENSES | 0.739 | 0.773 | 0.755 | 442 | |
|
| FELONY - UNSPECIFIED | 0.692 | 0.735 | 0.712 | 122 | |
|
| FLIGHT TO AVOID PROSECUTION | 0.46 | 0.407 | 0.425 | 38 | |
|
| FORCIBLE SODOMY | 0.82 | 0.8 | 0.809 | 76 | |
|
| FORGERY (FEDERAL ONLY) | 0 | 0 | 0 | 2 | |
|
| FORGERY/FRAUD | 0.911 | 0.928 | 0.919 | 4687 | |
|
| FRAUD (FEDERAL ONLY) | 0 | 0 | 0 | 2 | |
|
| GRAND LARCENY - THEFT OVER $200 | 0.957 | 0.973 | 0.965 | 2412 | |
|
| HABITUAL OFFENDER | 0.742 | 0.627 | 0.679 | 53 | |
|
| HEROIN VIOLATION - OFFENSE UNSPECIFIED | 0.879 | 0.811 | 0.843 | 24 | |
|
| HIT AND RUN DRIVING | 0.922 | 0.94 | 0.931 | 303 | |
|
| HIT/RUN DRIVING - PROPERTY DAMAGE | 0.929 | 0.918 | 0.923 | 362 | |
|
| IMMIGRATION VIOLATIONS | 0.84 | 0.609 | 0.697 | 19 | |
|
| INVASION OF PRIVACY | 0.927 | 0.923 | 0.925 | 1235 | |
|
| JUVENILE OFFENSES | 0.928 | 0.866 | 0.895 | 144 | |
|
| KIDNAPPING | 0.937 | 0.93 | 0.933 | 553 | |
|
| LARCENY/THEFT - VALUE UNKNOWN | 0.955 | 0.945 | 0.95 | 3175 | |
|
| LEWD ACT WITH CHILDREN | 0.775 | 0.85 | 0.811 | 596 | |
|
| LIQUOR LAW VIOLATIONS | 0.741 | 0.768 | 0.755 | 214 | |
|
| MANSLAUGHTER - NON-VEHICULAR | 0.626 | 0.802 | 0.701 | 139 | |
|
| MANSLAUGHTER - VEHICULAR | 0.79 | 0.853 | 0.819 | 117 | |
|
| MARIJUANA/HASHISH VIOLATION - OFFENSE UNSPECIFIED | 0.741 | 0.662 | 0.699 | 62 | |
|
| MISDEMEANOR UNSPECIFIED | 0.63 | 0.243 | 0.347 | 57 | |
|
| MORALS/DECENCY - OFFENSE | 0.774 | 0.764 | 0.769 | 412 | |
|
| MURDER | 0.965 | 0.915 | 0.939 | 621 | |
|
| OBSTRUCTION - LAW ENFORCEMENT | 0.939 | 0.947 | 0.943 | 4220 | |
|
| OFFENSES AGAINST COURTS, LEGISLATURES, AND COMMISSIONS | 0.881 | 0.895 | 0.888 | 1965 | |
|
| PAROLE VIOLATION | 0.97 | 0.953 | 0.962 | 946 | |
|
| PETTY LARCENY - THEFT UNDER $200 | 0.965 | 0.761 | 0.85 | 139 | |
|
| POSSESSION/USE - COCAINE OR CRACK | 0.893 | 0.928 | 0.908 | 68 | |
|
| POSSESSION/USE - DRUG UNSPECIFIED | 0.624 | 0.535 | 0.572 | 189 | |
|
| POSSESSION/USE - HEROIN | 0.884 | 0.852 | 0.866 | 25 | |
|
| POSSESSION/USE - MARIJUANA/HASHISH | 0.977 | 0.97 | 0.973 | 556 | |
|
| POSSESSION/USE - OTHER CONTROLLED SUBSTANCES | 0.975 | 0.965 | 0.97 | 3271 | |
|
| PROBATION VIOLATION | 0.963 | 0.953 | 0.958 | 1158 | |
|
| PROPERTY OFFENSES - OTHER | 0.901 | 0.87 | 0.885 | 446 | |
|
| PUBLIC ORDER OFFENSES - OTHER | 0.7 | 0.721 | 0.71 | 1871 | |
|
| RACKETEERING/EXTORTION (FEDERAL ONLY) | 0 | 0 | 0 | 2 | |
|
| RAPE - FORCE | 0.842 | 0.873 | 0.857 | 641 | |
|
| RAPE - STATUTORY - NO FORCE | 0.707 | 0.55 | 0.611 | 140 | |
|
| REGULATORY OFFENSES (FEDERAL ONLY) | 0.847 | 0.567 | 0.674 | 70 | |
|
| RIOTING | 0.784 | 0.605 | 0.68 | 119 | |
|
| SEXUAL ASSAULT - OTHER | 0.836 | 0.836 | 0.836 | 971 | |
|
| SIMPLE ASSAULT | 0.976 | 0.967 | 0.972 | 4577 | |
|
| STOLEN PROPERTY - RECEIVING | 0.959 | 0.957 | 0.958 | 1193 | |
|
| STOLEN PROPERTY - TRAFFICKING | 0.902 | 0.888 | 0.895 | 491 | |
|
| TAX LAW (FEDERAL ONLY) | 0.373 | 0.233 | 0.286 | 30 | |
|
| TRAFFIC OFFENSES - MINOR | 0.974 | 0.977 | 0.976 | 8699 | |
|
| TRAFFICKING - COCAINE OR CRACK | 0.896 | 0.951 | 0.922 | 185 | |
|
| TRAFFICKING - DRUG UNSPECIFIED | 0.709 | 0.795 | 0.749 | 516 | |
|
| TRAFFICKING - HEROIN | 0.871 | 0.92 | 0.894 | 54 | |
|
| TRAFFICKING - OTHER CONTROLLED SUBSTANCES | 0.963 | 0.954 | 0.959 | 2832 | |
|
| TRAFFICKING MARIJUANA/HASHISH | 0.921 | 0.943 | 0.932 | 255 | |
|
| TRESPASSING | 0.974 | 0.98 | 0.977 | 1916 | |
|
| UNARMED ROBBERY | 0.941 | 0.939 | 0.94 | 377 | |
|
| UNAUTHORIZED USE OF VEHICLE | 0.94 | 0.908 | 0.924 | 304 | |
|
| UNSPECIFIED HOMICIDE | 0.61 | 0.554 | 0.577 | 60 | |
|
| VIOLENT OFFENSES - OTHER | 0.827 | 0.817 | 0.822 | 606 | |
|
| VOLUNTARY/NONNEGLIGENT MANSLAUGHTER | 0.619 | 0.513 | 0.542 | 54 | |
|
| WEAPON OFFENSE | 0.943 | 0.949 | 0.946 | 2466 | |
|
|
|
*Note: `support` is the average number of observations predicted on per fold, so the total number of observations per class is roughly 3x `support`.* |
|
|
|
### Using Confidence Scores |
|
|
|
If we interpret the classification probability as a confidence score, we can use it to filter out predictions that the model isn't as confident about. We applied this process in 3-fold cross validation. The numbers presented below indicate how much of the prediction data is retained given a confidence score cutoff of `p`. We present the overall accuracy and MCC metrics as if the model was only evaluated on this subset of confident predictions. |
|
|
|
| | cutoff | percent retained | mcc | acc | |
|
| --- | ------ | ---------------- | ----- | ----- | |
|
| 0 | 0.85 | 0.952 | 0.96 | 0.961 | |
|
| 1 | 0.9 | 0.943 | 0.964 | 0.965 | |
|
| 2 | 0.95 | 0.928 | 0.97 | 0.971 | |
|
| 3 | 0.975 | 0.912 | 0.975 | 0.976 | |
|
| 4 | 0.99 | 0.886 | 0.982 | 0.983 | |
|
| 5 | 0.999 | 0.733 | 0.995 | 0.996 | |
|
|
|
|
|
|
|
|