Skip to content

HyGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors

License

Notifications You must be signed in to change notification settings

wh-xu/Hyper-Gen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License bsd-3-clause

HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors

HyperGen is a Rust library used to sketch genomic files and realize fast Average Nucleotide Identity (ANI) approximation. HyperGen leverages two advanced algorithms: 1. FracMinHash and 2. hyperdimensional computing (HDC) with random indexing as shown in the following figure:

HyperGen first samples the kmer set using FracMinHash. Then the kmer hashes are encoded into hyperdimensional vectors (HVs) using HDC encoding to obtain better tradeoff of ANI estimation quality, sketch size, and computation speed. The sketch size generated by HyperGen is 1.8 to 2.7× smaller than Mash and Dashing 2. ANI estimation in HyperGen can be realized using highly vectorized vector multiplication. HyperGen's database search speed for large-scale datasets is up to 4.3x faster than Dashing 2.

Quickstart

Installation

HyperGen requires Rust language and Cargo to be installed. We recommend installing HyperGen using the following command:

git clone https://github.com/wh-xu/Hyper-Gen.git
cd Hyper-Gen

cargo install --path . --root ~/.cargo

Usage

Current version supports following functions:

1. Genome sketching

Example:
hyper-gen-rust sketch -p ./data/*.fna -o ./fna.sketch

Positional arguments:
-p, --path <PATH>               Input folder path to sketch
-o, --out <OUT>                 Output path 
-t, --thread <THREAD>           Threads used for computation [default: 16]
-k, --ksize <KSIZE>             k-mer size for sketching [default: 21]
-s, --scaled <SCALED>           Scaled factor for FracMinHash [default: 1500]
-d, --hv_d <HD_D>               Dimension for hypervector [default: 4096]

2. ANI estimation and database search

Example:
hyper-gen-rust dist -r fna1.sketch -q fna2.sketch -o output.ani

Positional arguments:
-r, --path_r <PATH_R>           Path to ref sketch file
-q, --path_q <PATH_Q>           Path to query sketch file
-o, --out <OUT>                 Output path 
-t, --thread <THREAD>           Threads used for computation [default: 16]
--ani_th <ANI_TH>               ANI threshold [default: 85.0]

Differences between Mash and HyperGen

  • (a) Mash uses MinHash to sample kmer hash set and stores discrete hash values as the genome sketch.
  • (b) HyperGen uses FracMinHash to sample kmer hash set and encodes discrete hash values into continuous sketch hypervector.

Publication

  1. Weihong Xu, Po-kai Hsu, Niema Moshiri, Shimeng Yu, and Tajana Rosing. "HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors." Under review.

Contact

For more information, post an issue or send an email to [email protected].

About

HyGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages