Skip to content

kkBill/LLM4DB

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 

Repository files navigation

LLM x DB

Continuously update the works regarding (1) Large Language Model for Database (LLM4DB) and (2) Database for Large Language Model (DB4LLM) based on our past tutorials.

Kindly let us know if we have missed any great papers. Thank you!

Table of Contents

0. System & Review

NeurDB: On the Design and Implementation of an AI-powered Autonomous Database

*hanhao Zhao, Shaofeng Cai, Haotian Gao, Hexiang Pan, Siqi Xiang, Naili Xing, Gang Chen, Beng Chin Ooi, Yanyan Shen, Yuncheng Wu, Meihui Zhang. CIDR 2025. [pdf]

How Large Language Models Will Disrupt Data Management

Raul Castro Fernandez, Aaron J. Elmore, Michael J. Franklin, Sanjay Krishnan, Chenhao Tan. VLDB 2023. [pdf]

From Large Language Models to Databases and Back: A Discussion on Research and Education

Sihem Amer-Yahia, Angela Bonifati, Lei Chen, Guoliang Li, Kyuseok Shim, Jianliang Xu, Xiaochun Yang. SIGMOD Record. [pdf]

Applications and Challenges for Large Language Models: From Data Management Perspective.

Zhang, Meihui, Zhaoxuan Ji, Zhaojing Luo, Yuncheng Wu, and Chengliang Chai. ICDE 2024. [pdf]

Demystifying Data Management for Large Language Models

Xupeng Miao, Zhihao Jia, and Bin Cui. SIGMOD 2024. [pdf]

DB-GPT: Large Language Model Meets Database

Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. Data Science and Engineering 2023. [pdf]

LLM-Enhanced Data Management

Xuanhe Zhou, Xinyang Zhao, Guoliang Li. arxiv 2024. [pdf]

Can Foundation Models Wrangle Your Data?

Avanika Narayan, Ines Chami, Laurel J. Orr, Christopher Ré. VLDB 2022. [pdf]

Data Management For Training Large Language Models: A Survey

Zige Wang, Wanjun Zhong, Yufei Wang, Qi Zhu, Fei Mi, Baojun Wang, Lifeng Shang, Xin Jiang, Qun Liu. arxiv 2024. [pdf]

When Large Language Models Meet Vector Databases: A Survey

Zhi Jing, Yongye Su, Yikun Han, Bo Yuan, Haiyun Xu, Chunjiang Liu, Kehai Chen, Min Zhang. arxiv 2024. [pdf]

From BERT to GPT-3 Codex: Harnessing the Potential of Very Large Language Models for Data Management

Immanuel Trummer. VLDB 2023. [pdf]

1. LLM for Data Processing

There are relevant works currently, we prioritize papers in the database field.

1.1 Data Cleaning

Jellyfish: A Large Language Model for Data Preprocessing

Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada. arxiv 2024. [pdf]

LLMs with User-defined Prompts as Generic Data Operators for Reliable Data Processing

Luyi Ma, Nikhil Thakurdesai, Jiao Chen, Jianpeng Xu, Evren Körpeoglu, Sushant Kumar, Kannan Achan. IEEE Big Data 2023. [pdf]

CleanAgent: Automating Data Standardization with LLM-based Agents

Danrui Qi, Jiannan Wang. arxiv 2024. [pdf]

LLMClean: Context-Aware Tabular Data Cleaning via LLM-Generated OFDs

Fabian Biester, Mohamed Abdelaal, Daniel Del Gaudio. arxiv 2024. [pdf]

SEED: Domain-Specific Data Curation With Large Language Models

Zui Chen, Lei Cao, Sam Madden, Tim Kraska, Zeyuan Shang, Ju Fan, Nan Tang, Zihui Gu, Chunwei Liu, Michael Cafarella. arxiv 2023. [pdf]

Large Language Models as Data Preprocessors

Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada. arxiv 2023. [pdf]

Data Cleaning Using Large Language Models

Shuo Zhang, Zezhou Huang, Eugene Wu. arxiv 2024. [pdf]

1.2 Entity Matching

Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration

Meihao Fan, Xiaoyue Han, Ju Fan, Chengliang Chai, Nan Tang, Guoliang Li, Xiaoyong Du. ICDE 2024. [pdf]

In Situ Neural Relational Schema Matcher

Xingyu Du, Gongsheng Yuan, Sai Wu, Gang Chen, and Peng Lu. ICDE 2024. [pdf]

Match, Compare, or Select? An Investigation of Large Language Models for Entity Matching

Tianshu Wang, Hongyu Lin, Xiaoyang Chen, Xianpei Han, Hao Wang, Zhenyu Zeng, Le Sun. arxiv 2024. [pdf]

KcMF: A Knowledge-compliant Framework for Schema and Entity Matching with Fine-tuning-free LLMs

Yongqin Xu, Huan Li, Ke Chen, Lidan Shou. arxiv 2024. [pdf]

Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration

Jianhong Tu, Ju Fan, Nan Tang, Peng Wang, Guoliang Li, Xiaoyong Du. SIGMOD 2023. [pdf]

Entity matching using large language models

Ralph Peeters, Christian Bizer. arxiv 2023. [pdf]

Deep Entity Matching with Pre-Trained Language Models

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, Wang-Chiew Tan. VLDB 2021. [pdf]

Dual-Objective Fine-Tuning of BERT for Entity Matching

Ralph Peeters, Christian Bizer. VLDB 2021. [pdf]

1.3 Schema Matching

Schema Matching with Large Language Models: an Experimental Study

Marcel Parciak, Brecht Vandevoort, Frank Neven, Liesbet M. Peeters, Stijn Vansummeren. arxiv 2024. [pdf]

Schema Matching using Pre-Trained Language Models

Yunjia Zhang, Avrilia Floratou, Joyce Cahoon, Subru Krishnan, Andreas C. Müller, Dalitso Banda, Fotis Psallidas, Jignesh M. Patel. ICDE 2023. [pdf]

KcMF: A Knowledge-compliant Framework for Schema and Entity Matching with Fine-tuning-free LLMs

Yongqin Xu, Huan Li, Ke Chen, Lidan Shou. arxiv 2024. [pdf]

1.4 Data Discovery

CHORUS: Foundation Models for Unified Data Discovery and Exploration

Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, Dan Suciu. VLDB 2024. [pdf]

Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes

Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, Christopher Ré. VLDB 2024. [pdf]

DeepJoin: Joinable Table Discovery with Pre-trained Language Models

Yuyang Dong, Chuan Xiao, Takuma Nozawa, Masafumi Enomoto, Masafumi Oyamada. VLDB 2023. [pdf]

2. LLM for Database Optimization

2.1 Knob Tuning

λ-Tune: Harnessing Large Language Models for Automated Database System Tuning

Victor Giannankouris, Immanuel Trummer. SIGMOD 2025. [pdf]

LATuner: An LLM-Enhanced Database Tuning System Based on Adaptive Surrogate Model

Fan C, Pan Z, Sun W, et al. Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2024. [pdf]

LLMTune: Accelerate Database Knob Tuning with Large Language Models

Huang X, Li H, Zhang J, et al. arXiv 2024. [pdf]

Is Large Language Model Good at Database Knob Tuning? A Comprehensive Experimental Evaluation

Yiyan Li, Haoyang Li, Zhao Pu, Jing Zhang, Xinyi Zhang, Tao Ji, Luming Sun, Cuiping Li, Hong Chen. arXiv 2024. [pdf]

GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization

Jiale Lao, Yibo Wang, Yufei Li, Jianping Wang, Yunjia Zhang, Zhiyuan Cheng, Wanghu Chen, Mingjie Tang, Jianguo Wang. VLDB 2024. [pdf]

DB-BERT: a Database Tuning Tool that “Reads the Manual”

Immanuel Trummer. SIGMOD 2022. [pdf]

2.2 Query Optimization

LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency

Zhaodonghui Li, Haitao Yuan#, Huiming Wang, Gao Cong, Lidong Bing. VLDB 2024. [pdf]

The Unreasonable Effectiveness of LLMs for Query Optimization

Peter Akioyamen, Zixuan Yi, Ryan Marcus. NeurIPS 2024 (Workshop). [pdf]

2.3 Database Diagnosis

Panda: Performance debugging for databases using LLM agents

Vikramank Singh, Kapil Eknath Vaidya, ..., Tim Kraska. CIDR 2024. [pdf]

LLM As DBA

Xuanhe Zhou, Guoliang Li, Zhiyuan Liu. arXiv 2023. [pdf]

D-Bot: Database Diagnosis System using Large Language Models

Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, Zhiyuan Liu, Weize Chen, et al. VLDB 2024. [pdf] [code]

3. LLM for Data Analysis

3.1 NL2SQL

Text2SQL is Not Enough: Unifying AI and Databases with TAG

Asim Biswal, Siddharth Jha, Carlos Guestrin, Matei Zaharia, Joseph E Gonzalez, Amog Kamsetty, Shu Liu, Liana Patel. CIDR 2025. [pdf]

The Dawn of Natural Language to SQL: Are We Fully Ready?

Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, Nan Tang. VLDB 2024. [pdf]

PURPLE: Making a Large Language Model a Better SQL Writer

Ren, Tonghui, Yuankai Fan, Zhenying He, Ren Huang, Jiaqi Dai, Can Huang, Yinan Jing, Kai Zhang, Yifan Yang, and X. Sean Wang. ICDE 2024.[pdf]

SM3-Text-to-Query: Synthetic Multi-Model Medical Text-to-Query Benchmark

Sithursan Sivasubramaniam, Cedric Osei-Akoto, Yi Zhang, Kurt Stockinger, Jonathan Fuerst. NeurIPS 2024. [pdf]

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows

Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, Tao Yu. arxiv 2024. [pdf]

SiriusBI: Building End-to-End Business Intelligence Enhanced by Large Language Models

Jie Jiang, Haining Xie, Yu Shen, Zihan Zhang, Meng Lei, Yifeng Zheng, Yide Fang, Chunyou Li, Danqing Huang, Wentao Zhang, Yang Li, Xiaofeng Yang, Bin Cui, Peng Chen. arxiv 2024. [pdf]

Grounding Natural Language to SQL Translation with Data-Based Self-Explanations

Yuankai Fan, Tonghui Ren, Can Huang, Zhenying He, X. Sean Wang. arxiv 2024. [pdf]

LR-SQL: A Supervised Fine-Tuning Method for Text2SQL Tasks under Low-Resource Scenarios

Wen Wuzhenghong, Zhang Yongpan, Pan Su, Sun Yuwei, Lu Pengwei, Ding Cheng. arxiv 2024. [pdf]

CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL

Mohammadreza Pourreza, Hailong Li, Ruoxi Sun, Yeounoh Chung, Shayan Talaei, Gaurav Tarlok Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, Sercan O. Arik. arxiv 2024. [pdf]

MoMQ: Mixture-of-Experts Enhances Multi-Dialect Query Generation across Relational and Non-Relational Databases

Zhisheng Lin, Yifu Liu, Zhiling Luo, Jinyang Gao, Yu Li. arxiv 2024. [pdf]

From BERT to GPT-3 Codex: Harnessing the Potential of Very Large Language Models for Data Management

Immanuel Trummer. VLDB 2022. [pdf]

Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning

Zihui Gu, Ju Fan, Nan Tang, et al. SIGMOD 2023. [pdf]

3.2 Data Exploration

Db-gpt: Empowering database interactions with private large language models

Siqiao Xue, Caigao Jiang, Wenhui Shi, Fangyin Cheng, Keting Chen, Hongjun Yang, Zhiping Zhang, Jianshan He, Hongyang Zhang, Ganglin Wei, Wang Zhao, Fan Zhou, Danrui Qi, Hong Yi, Shaodong Liu, Faqiang Chen. arxiv 2023. [pdf]

3.3 Data Visualization

LLM4Vis: Explainable Visualization Recommendation using ChatGPT

Lei Wang, Songheng Zhang, Yun Wang, Ee-Peng Lim, Yong Wang. EMNLP 2023. [pdf]

4. Data Management for LLM

Data-Juicer: A One-Stop Data Processing System for Large Language Models

Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding, Jingren Zhou. SIGMOD 2024. [pdf]

CoachLM: Automatic Instruction Revisions Improve the Data Quality in LLM Instruction Tuning

Liu, Yilun, Shimin Tao, Xiaofeng Zhao, Ming Zhu, Wenbing Ma, Junhao Zhu, Chang Su et al. ICDE 2024. [pdf]

Relational Database Augmented Large Language Model

Zongyue Qin, Chen Luo, Zhengyang Wang, Haoming Jiang, Yizhou Sun. arxiv 2024. [pdf]

Survey of Vector Database Management Systems

James Jie Pan, Jianguo Wang, Guoliang Li. arxiv 2023. [pdf]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published