1 stable release
| 1.0.0 | Jan 24, 2026 |
|---|
#31 in Biology
34KB
754 lines
hashfasta
hash fasta (faster?)
Very quickly compute hashes for FASTA/FASTQ files considering only the sequence content.
Hashfasta produces a hash for an input file that is stable and dependent only on the sequence content. It ignores:
- FASTA/FASTQ headers
- Quality scores
- Read order
It can optionally consider only the canonical sequence (the lexicographically smaller of the forward and reverse complement) and/or normalise sequences before hashing. See Sequence handling options.
It supports FASTA and FASTQ files (optionally compressed with gz) as input, and can read via http/https, SSH and stdin.
Installation
Binaries:
Precompiled binaries for Linux, MacOS and Windows are attached to the latest release.
Cargo:
Requires cargo
cargo install hashfasta
Build from source:
Install rust toolchain:
To install please refer to the rust documentation: docs
Clone the repository:
git clone https://github.com/Sam-Sims/hashfasta
Build and add to path:
cd hashfasta
cargo build --release
export PATH=$PATH:$(pwd)/target/release
All executables will be in the directory hashfasta/target/release.
Usage
Very quickly compute hashes for FASTX files considering **only** the sequence content.
Usage: hashfasta <COMMAND>
Commands:
hash Hash every record in the input and output a final aggregate_hash representing the sequence content of the entire file.
unique Output only the records whose sequences are unique within the input.
duplicate Output only the records whose sequences are duplicates of earlier records.
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
-V, --version Print version
Subcommands
Each subcommand takes a single positional argument <FASTX> which is the path/URL to the input FASTA/FASTQ file (or - for stdin).
By default, output is written to stdout in tab-separated format. Use --json to output in JSON format. See Output Formats.
hash
Hash every record in the input and output a final "aggregate_hash" representing the seqeunce content of the entire file. Suppress per-record output with -q/--quiet.
hashfasta hash [OPTIONS] <FASTX>
unique
Output only the records whose sequences are unique within the input. Useful for deduplicating records.
hashfasta unique [OPTIONS] <FASTX>
duplicate
Output only the records whose sequences are duplicates of earlier records.
hashfasta duplicate [OPTIONS] <FASTX>
Options
All subcommands share the following options:
-c, --canonical: Use the canonical sequence. See Canonicalisation.-n, --normalise: Normalise sequences before hashing. See Normalisation.-s, --strict: Fail if non-ACGTUN- bases are encountered.-t, --threads: Number of threads [default: 1].-j, --json: Output records as JSON instead of TSV.-q, --quiet: Only print the aggregate hash (suppress record-level output).
Sequence handling options
Normalisation (-n / --normalise)
Normalisation ensures that sequences are treated consistently regardless of case or RNA/DNA differences. It performs the following:
- Converts all characters to uppercase.
- Converts Uracil (
U) to Thymine (T). - Masks any non-standard nucleotides (i.e characters other than
ACGTU-) asN.
Canonicalisation (-c / --canonical)
Canonicalisation does the following:
- Generates the reverse complement of the sequence.
- Compares the original sequence with its reverse complement lexicographically.
- Hashes the "smaller" of the two sequences.
This ensures that a sequence and its reverse complement will always produce the same hash.
Examples
# Generate hashes for a single fasta file
hashfasta hash sequences.fasta
# Hash a compressed file normalising sequences, and consider canonical sequences
hashfasta hash -cn reads.fq.gz
# Fail early if invalid bases are encountered
hashfasta hash --strict input.fasta
# Output as JSON, instead of TSV
hashfasta hash --json input.fasta
# Read from stdin
tar -xOf collection.tar.gz | hashfasta hash -
# Read from HTTP
hashfasta hash https://example.com/sequences.fasta
# Read from SSH
hashfasta hash ssh://user@host/path/to/sequences.fasta
Output Format
Default (TSV):
id hash
seq1 537edb87f29c16e5
seq2 2cbf6181d9bdb039
aggregate_hash 02402baf8722f975
JSON (--json):
{
"records": [
{
"id": "seq1",
"hash": "537edb87f29c16e5"
},
{
"id": "seq2",
"hash": "2cbf6181d9bdb039"
}
],
"aggregate_hash": "02402baf8722f975"
}
Dependencies
~23–39MB
~520K SLoC