🆕 Check out our JAMIA paper which analyzes cross-entropy as an audit log metric in depth. Updated code here.
This repo contains code for training and evaluating transformer-based tabular language models for Epic EHR audit logs. You can use several of our pretrained models for entropy estimation or tabular generation, or train your own model from scratch.
Want to see audit log generation and cross-entropy calculation in action? Try out our audit-icu-gpt2-25.3M
model on Hugging Face!
Use pip install -r requirements.txt
to install the required packages. If updated pipreqs . --savepath requirements.txt --ignore Sophia
to update. Use git submodule update --init --recursive
to get Sophia for training.
This project uses pre-commit hooks for black
if you would like to contribute. To install run pre-commit install
.
Our pretrained models are available on Hugging Face and are mostly compatible with the transformers
library. Here's a full list of the available models:
Architecture | # Params | Repository Name |
---|---|---|
GPT2 | 25.3M | audit-icu-gpt2-25_3M |
GPT2 | 46.5M | audit-icu-gpt2-46_5M |
GPT2 | 89.0M | audit-icu-gpt2-131_6M |
GPT2 | 131.6M | audit-icu-gpt2-131_6M |
RWKV | 65.7M | audit-icu-rwkv-65_7M |
RWKV | 127.2M | audit-icu-rwkv-127_2M |
LLaMA | 58.1M | audit-icu-llama-58_1M |
LLaMA | 112.0M | audit-icu-llama-112_0M |
LLaMA | 219.8M | audit-icu-llama-219_8M |
To use our models for cross-entropy loss, see entropy.py
for a broad overview of the setup needed. Since they're built with transformers
you can also use these models for generative tasks in nearly the same way as any other language model. See gen.py
for an example of how to do this.
Please cite our paper if you use this code in your own work:
@misc{warner2023autoregressive,
title={Autoregressive Language Models For Estimating the Entropy of Epic EHR Audit Logs},
author={Benjamin C. Warner and Thomas Kannampallil and Seunghwan Kim},
year={2023},
eprint={2311.06401},
archivePrefix={arXiv},
primaryClass={cs.CL}
}