Update expression database loader script to do staging + upsert for all tables except production/degradation rates by ntran18 · Pull Request #1297 · dondi/GRNsight

ntran18 · 2025-12-15T08:09:37Z

No description provided.

…radation rates to avoid duplicate conflicts

Copilot

Pull request overview

This PR refactors the expression database loading scripts to use a staging table + upsert pattern for most tables (refs, genes, expression metadata, and expression data), while keeping the existing TRUNCATE+COPY approach for production and degradation rate tables. The changes introduce a reusable copy_to_staging_then_upsert function and reorganize the execution order to ensure dependencies are loaded in the correct sequence.

Key changes:

Introduces a new copy_to_staging_then_upsert helper function in loader.py that creates temporary staging tables, copies data, and performs upserts with conflict resolution
Refactors preprocessing.py to streamline the execution flow and remove the --genes command-line flag
Reorders data loading to ensure refs and genes are loaded before expression data that depends on them

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
database/expression-database/scripts/preprocessing.py	Refactored main execution flow to remove --genes flag, reorder processing steps, and add return statements to degradation/production rate functions
database/expression-database/scripts/loader.py	Introduced staging+upsert pattern through new helper function; refactored LOAD_REFS, LOAD_GENES, LOAD_EXPRESSION_METADATA, and LOAD_EXPRESSION_DATA to use this pattern; reordered load actions dictionary

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-15T08:14:10Z

database/expression-database/scripts/preprocessing.py

-        if not genes:
-            genes = process_expression_data(source_dir)
-        write_genes(genes)
+    write_genes(genes)


The 'genes' variable is not initialized when only --refs or --meta flags are used without --expr, --prod, or --deg. This will cause a NameError when write_genes(genes) is called on line 215. The variable needs to be initialized before the conditional blocks, or write_genes should only be called when genes is defined.

Suggested change

write_genes(genes)

write_genes(genes)

Copilot · 2025-12-15T08:14:10Z

database/expression-database/scripts/loader.py

+    conflict_cols: list[str] | None,
+    update_cols: list[str] | None,
+    select_exprs: list[str] | None = None,
+):


The copy_to_staging_then_upsert function lacks a docstring explaining its purpose, parameters, and behavior. This is particularly important for a reusable utility function that handles complex upsert logic. Consider adding a docstring that explains the staging table approach, when to use conflict_cols vs update_cols, and what select_exprs is used for.

Suggested change

):

):

"""

Loads data from a source file into a temporary staging table, then inserts or upserts the data into a target table.

This function is intended for bulk loading data into a database using SQL statements. It first creates a temporary

staging table with the specified columns, loads data from the given source file (tab-delimited), and then inserts

the data into the target table. If conflict columns are specified, it performs an upsert (insert or update on conflict).

Parameters:

staging_table (str): Name of the temporary staging table to create.

target_table (str): Name of the target table to insert/upsert data into.

columns (list[str]): List of column names for both the staging and target tables.

source_path (str): Path to the source file containing data to load (tab-delimited, with header).

conflict_cols (list[str] | None): List of columns to check for conflicts (used in ON CONFLICT clause).

If None, a simple insert is performed with no conflict handling.

update_cols (list[str] | None): List of columns to update if a conflict occurs. If provided, an upsert is performed

(ON CONFLICT DO UPDATE SET ...). If None, conflicts are ignored (ON CONFLICT DO NOTHING).

select_exprs (list[str] | None): Optional list of SQL expressions to use in the SELECT statement when inserting

from the staging table. If None, all columns are selected as-is.

Behavior:

- Creates a temporary staging table with all columns as text.

- Loads data from the source file into the staging table using COPY.

- If conflict_cols is None, performs a simple INSERT INTO target_table SELECT ... FROM staging_table.

- If conflict_cols is provided and update_cols is provided, performs an upsert (ON CONFLICT DO UPDATE SET ...).

- If conflict_cols is provided and update_cols is None, performs an insert with ON CONFLICT DO NOTHING.

- select_exprs can be used to transform or cast columns during the insert/upsert.

"""

Copilot · 2025-12-15T08:14:11Z

database/expression-database/scripts/loader.py

+    # Always make the genes load first if asking for expression, production, or degradation
+    if args.all or args.expr or args.prod or args.deg:
+        args.genes = True


The logic to automatically enable genes loading (lines 258-260) may cause LOAD_GENES to be called even when only --prod or --deg flags are specified. However, the production and degradation rate tables likely have foreign key constraints referencing the gene table. If the genes.csv file is not generated (which happens in preprocessing.py only when --expr, --prod, or --deg are run), this will fail with a file not found error when trying to load from '../script-results/processed-expression/genes.csv'.

coveralls · 2025-12-15T08:26:21Z

coverage: 80.06%. remained the same
when pulling d1469aa on maika-1270
into beade2c on beta.

…lder' instead of 'Expression 2020' and add more description on how to update the database and a diagram of the overview of this folder

dondi

From a code style standpoint, LGTM

Presumably you have verified functionality 😊

ntran18 added 2 commits December 15, 2025 14:02

Fixing the correct order to populate data into database

c872f29

Adding staging + upsert function for all tables except production/deg…

a2cdd4f

…radation rates to avoid duplicate conflicts

ntran18 changed the base branch from main to beta December 15, 2025 08:09

ntran18 requested review from Copilot and dondi December 15, 2025 08:09

Copilot started reviewing on behalf of ntran18 December 15, 2025 08:10 View session

Copilot AI reviewed Dec 15, 2025

View reviewed changes

Adding docstring that Copilot suggested

18d226a

ntran18 added 2 commits December 15, 2025 16:00

Update README.md to add the final source folder in Box to 'Current Fo…

18c9a76

…lder' instead of 'Expression 2020' and add more description on how to update the database and a diagram of the overview of this folder

Indent 'write_genes' function correctly

d1469aa

ntran18 mentioned this pull request Dec 15, 2025

Extend the degradation rates and production rates data from Neymotin to all genes, not just regulatory transcription factors #1270

Closed

dondi approved these changes Dec 19, 2025

View reviewed changes

dondi merged commit a21d51a into beta Dec 19, 2025
5 checks passed

dondi deleted the maika-1270 branch December 19, 2025 07:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update expression database loader script to do staging + upsert for all tables except production/degradation rates#1297

Update expression database loader script to do staging + upsert for all tables except production/degradation rates#1297
dondi merged 5 commits intobetafrom
maika-1270

ntran18 commented Dec 15, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 15, 2025

Uh oh!

Copilot AI Dec 15, 2025

Uh oh!

Copilot AI Dec 15, 2025

Uh oh!

coveralls commented Dec 15, 2025 •

edited

Loading

Uh oh!

dondi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-):
+):
+    """
+    Loads data from a source file into a temporary staging table, then inserts or upserts the data into a target table.
+    This function is intended for bulk loading data into a database using SQL statements. It first creates a temporary
+    staging table with the specified columns, loads data from the given source file (tab-delimited), and then inserts
+    the data into the target table. If conflict columns are specified, it performs an upsert (insert or update on conflict).
+    Parameters:
+        staging_table (str): Name of the temporary staging table to create.
+        target_table (str): Name of the target table to insert/upsert data into.
+        columns (list[str]): List of column names for both the staging and target tables.
+        source_path (str): Path to the source file containing data to load (tab-delimited, with header).
+        conflict_cols (list[str] | None): List of columns to check for conflicts (used in ON CONFLICT clause).
+            If None, a simple insert is performed with no conflict handling.
+        update_cols (list[str] | None): List of columns to update if a conflict occurs. If provided, an upsert is performed
+            (ON CONFLICT DO UPDATE SET ...). If None, conflicts are ignored (ON CONFLICT DO NOTHING).
+        select_exprs (list[str] | None): Optional list of SQL expressions to use in the SELECT statement when inserting
+            from the staging table. If None, all columns are selected as-is.
+    Behavior:
+        - Creates a temporary staging table with all columns as text.
+        - Loads data from the source file into the staging table using COPY.
+        - If conflict_cols is None, performs a simple INSERT INTO target_table SELECT ... FROM staging_table.
+        - If conflict_cols is provided and update_cols is provided, performs an upsert (ON CONFLICT DO UPDATE SET ...).
+        - If conflict_cols is provided and update_cols is None, performs an insert with ON CONFLICT DO NOTHING.
+        - select_exprs can be used to transform or cast columns during the insert/upsert.
+    """

Conversation

ntran18 commented Dec 15, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

coveralls commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dondi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coveralls commented Dec 15, 2025 •

edited

Loading