Conversation
…radation rates to avoid duplicate conflicts
There was a problem hiding this comment.
Pull request overview
This PR refactors the expression database loading scripts to use a staging table + upsert pattern for most tables (refs, genes, expression metadata, and expression data), while keeping the existing TRUNCATE+COPY approach for production and degradation rate tables. The changes introduce a reusable copy_to_staging_then_upsert function and reorganize the execution order to ensure dependencies are loaded in the correct sequence.
Key changes:
- Introduces a new
copy_to_staging_then_upserthelper function in loader.py that creates temporary staging tables, copies data, and performs upserts with conflict resolution - Refactors preprocessing.py to streamline the execution flow and remove the
--genescommand-line flag - Reorders data loading to ensure refs and genes are loaded before expression data that depends on them
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| database/expression-database/scripts/preprocessing.py | Refactored main execution flow to remove --genes flag, reorder processing steps, and add return statements to degradation/production rate functions |
| database/expression-database/scripts/loader.py | Introduced staging+upsert pattern through new helper function; refactored LOAD_REFS, LOAD_GENES, LOAD_EXPRESSION_METADATA, and LOAD_EXPRESSION_DATA to use this pattern; reordered load actions dictionary |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if not genes: | ||
| genes = process_expression_data(source_dir) | ||
| write_genes(genes) No newline at end of file | ||
| write_genes(genes) No newline at end of file |
There was a problem hiding this comment.
The 'genes' variable is not initialized when only --refs or --meta flags are used without --expr, --prod, or --deg. This will cause a NameError when write_genes(genes) is called on line 215. The variable needs to be initialized before the conditional blocks, or write_genes should only be called when genes is defined.
| write_genes(genes) | |
| write_genes(genes) |
| conflict_cols: list[str] | None, | ||
| update_cols: list[str] | None, | ||
| select_exprs: list[str] | None = None, | ||
| ): |
There was a problem hiding this comment.
The copy_to_staging_then_upsert function lacks a docstring explaining its purpose, parameters, and behavior. This is particularly important for a reusable utility function that handles complex upsert logic. Consider adding a docstring that explains the staging table approach, when to use conflict_cols vs update_cols, and what select_exprs is used for.
| ): | |
| ): | |
| """ | |
| Loads data from a source file into a temporary staging table, then inserts or upserts the data into a target table. | |
| This function is intended for bulk loading data into a database using SQL statements. It first creates a temporary | |
| staging table with the specified columns, loads data from the given source file (tab-delimited), and then inserts | |
| the data into the target table. If conflict columns are specified, it performs an upsert (insert or update on conflict). | |
| Parameters: | |
| staging_table (str): Name of the temporary staging table to create. | |
| target_table (str): Name of the target table to insert/upsert data into. | |
| columns (list[str]): List of column names for both the staging and target tables. | |
| source_path (str): Path to the source file containing data to load (tab-delimited, with header). | |
| conflict_cols (list[str] | None): List of columns to check for conflicts (used in ON CONFLICT clause). | |
| If None, a simple insert is performed with no conflict handling. | |
| update_cols (list[str] | None): List of columns to update if a conflict occurs. If provided, an upsert is performed | |
| (ON CONFLICT DO UPDATE SET ...). If None, conflicts are ignored (ON CONFLICT DO NOTHING). | |
| select_exprs (list[str] | None): Optional list of SQL expressions to use in the SELECT statement when inserting | |
| from the staging table. If None, all columns are selected as-is. | |
| Behavior: | |
| - Creates a temporary staging table with all columns as text. | |
| - Loads data from the source file into the staging table using COPY. | |
| - If conflict_cols is None, performs a simple INSERT INTO target_table SELECT ... FROM staging_table. | |
| - If conflict_cols is provided and update_cols is provided, performs an upsert (ON CONFLICT DO UPDATE SET ...). | |
| - If conflict_cols is provided and update_cols is None, performs an insert with ON CONFLICT DO NOTHING. | |
| - select_exprs can be used to transform or cast columns during the insert/upsert. | |
| """ |
| # Always make the genes load first if asking for expression, production, or degradation | ||
| if args.all or args.expr or args.prod or args.deg: | ||
| args.genes = True |
There was a problem hiding this comment.
The logic to automatically enable genes loading (lines 258-260) may cause LOAD_GENES to be called even when only --prod or --deg flags are specified. However, the production and degradation rate tables likely have foreign key constraints referencing the gene table. If the genes.csv file is not generated (which happens in preprocessing.py only when --expr, --prod, or --deg are run), this will fail with a file not found error when trying to load from '../script-results/processed-expression/genes.csv'.
…lder' instead of 'Expression 2020' and add more description on how to update the database and a diagram of the overview of this folder
dondi
left a comment
There was a problem hiding this comment.
From a code style standpoint, LGTM
Presumably you have verified functionality 😊
No description provided.