Skip to content

Update expression database loader script to do staging + upsert for all tables except production/degradation rates#1297

Merged
dondi merged 5 commits intobetafrom
maika-1270
Dec 19, 2025
Merged

Update expression database loader script to do staging + upsert for all tables except production/degradation rates#1297
dondi merged 5 commits intobetafrom
maika-1270

Conversation

@ntran18
Copy link
Collaborator

@ntran18 ntran18 commented Dec 15, 2025

No description provided.

@ntran18 ntran18 changed the base branch from main to beta December 15, 2025 08:09
@ntran18 ntran18 requested review from Copilot and dondi December 15, 2025 08:09
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the expression database loading scripts to use a staging table + upsert pattern for most tables (refs, genes, expression metadata, and expression data), while keeping the existing TRUNCATE+COPY approach for production and degradation rate tables. The changes introduce a reusable copy_to_staging_then_upsert function and reorganize the execution order to ensure dependencies are loaded in the correct sequence.

Key changes:

  • Introduces a new copy_to_staging_then_upsert helper function in loader.py that creates temporary staging tables, copies data, and performs upserts with conflict resolution
  • Refactors preprocessing.py to streamline the execution flow and remove the --genes command-line flag
  • Reorders data loading to ensure refs and genes are loaded before expression data that depends on them

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
database/expression-database/scripts/preprocessing.py Refactored main execution flow to remove --genes flag, reorder processing steps, and add return statements to degradation/production rate functions
database/expression-database/scripts/loader.py Introduced staging+upsert pattern through new helper function; refactored LOAD_REFS, LOAD_GENES, LOAD_EXPRESSION_METADATA, and LOAD_EXPRESSION_DATA to use this pattern; reordered load actions dictionary

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if not genes:
genes = process_expression_data(source_dir)
write_genes(genes) No newline at end of file
write_genes(genes) No newline at end of file
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'genes' variable is not initialized when only --refs or --meta flags are used without --expr, --prod, or --deg. This will cause a NameError when write_genes(genes) is called on line 215. The variable needs to be initialized before the conditional blocks, or write_genes should only be called when genes is defined.

Suggested change
write_genes(genes)
write_genes(genes)

Copilot uses AI. Check for mistakes.
conflict_cols: list[str] | None,
update_cols: list[str] | None,
select_exprs: list[str] | None = None,
):
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copy_to_staging_then_upsert function lacks a docstring explaining its purpose, parameters, and behavior. This is particularly important for a reusable utility function that handles complex upsert logic. Consider adding a docstring that explains the staging table approach, when to use conflict_cols vs update_cols, and what select_exprs is used for.

Suggested change
):
):
"""
Loads data from a source file into a temporary staging table, then inserts or upserts the data into a target table.
This function is intended for bulk loading data into a database using SQL statements. It first creates a temporary
staging table with the specified columns, loads data from the given source file (tab-delimited), and then inserts
the data into the target table. If conflict columns are specified, it performs an upsert (insert or update on conflict).
Parameters:
staging_table (str): Name of the temporary staging table to create.
target_table (str): Name of the target table to insert/upsert data into.
columns (list[str]): List of column names for both the staging and target tables.
source_path (str): Path to the source file containing data to load (tab-delimited, with header).
conflict_cols (list[str] | None): List of columns to check for conflicts (used in ON CONFLICT clause).
If None, a simple insert is performed with no conflict handling.
update_cols (list[str] | None): List of columns to update if a conflict occurs. If provided, an upsert is performed
(ON CONFLICT DO UPDATE SET ...). If None, conflicts are ignored (ON CONFLICT DO NOTHING).
select_exprs (list[str] | None): Optional list of SQL expressions to use in the SELECT statement when inserting
from the staging table. If None, all columns are selected as-is.
Behavior:
- Creates a temporary staging table with all columns as text.
- Loads data from the source file into the staging table using COPY.
- If conflict_cols is None, performs a simple INSERT INTO target_table SELECT ... FROM staging_table.
- If conflict_cols is provided and update_cols is provided, performs an upsert (ON CONFLICT DO UPDATE SET ...).
- If conflict_cols is provided and update_cols is None, performs an insert with ON CONFLICT DO NOTHING.
- select_exprs can be used to transform or cast columns during the insert/upsert.
"""

Copilot uses AI. Check for mistakes.
Comment on lines +258 to +260
# Always make the genes load first if asking for expression, production, or degradation
if args.all or args.expr or args.prod or args.deg:
args.genes = True
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic to automatically enable genes loading (lines 258-260) may cause LOAD_GENES to be called even when only --prod or --deg flags are specified. However, the production and degradation rate tables likely have foreign key constraints referencing the gene table. If the genes.csv file is not generated (which happens in preprocessing.py only when --expr, --prod, or --deg are run), this will fail with a file not found error when trying to load from '../script-results/processed-expression/genes.csv'.

Copilot uses AI. Check for mistakes.
@coveralls
Copy link

coveralls commented Dec 15, 2025

Coverage Status

coverage: 80.06%. remained the same
when pulling d1469aa on maika-1270
into beade2c on beta.

…lder' instead of 'Expression 2020' and add more description on how to update the database and a diagram of the overview of this folder
Copy link
Owner

@dondi dondi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a code style standpoint, LGTM

Presumably you have verified functionality 😊

@dondi dondi merged commit a21d51a into beta Dec 19, 2025
5 checks passed
@dondi dondi deleted the maika-1270 branch December 19, 2025 07:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants