7 releases
Uses new Rust 2024
| 0.1.7 | Jan 15, 2026 |
|---|---|
| 0.1.6 | Nov 28, 2025 |
| 0.1.0 | Oct 29, 2025 |
#218 in Biology
35KB
415 lines
bamslice
Extract specific byte ranges from BAM/CRAM files and convert to interleaved FASTQ format. Designed for parallel processing across compute nodes without requiring pre-indexing.
Features
- No pre-indexing required - accepts approximate byte offsets
- Auto-aligns to block boundaries - finds the next valid BGZF block at or after the start offset
- Byte-range based - process arbitrary byte ranges for easy parallelization
- No overlap - using contiguous byte ranges guarantees no duplicate reads
- Interleaved FASTQ output - same format as
samtools fastq - Parallel-ready - designed for distributed processing
Installation
cargo build --release
Binary: target/release/bamslice
Usage
bamslice \
--input input.bam \
--start-offset 0 \
--end-offset 10000000 \
--output output.fastq
Arguments
--input, -i: Input BAM--start-offset, -s: Starting byte offset (will find next BGZF block at or after this offset)--end-offset, -e: Ending byte offset (will stop when reaching a block at or after this offset)--output, -o: Output FASTQ file (default: stdout)
Examples
Extract first half of file:
FILE_SIZE=$(stat -f%z input.bam) # macOS
# FILE_SIZE=$(stat -c%s input.bam) # Linux
HALF=$((FILE_SIZE / 2))
bamslice -i input.bam -s 0 -e $HALF -o first_half.fastq
Extract second half (no overlap!):
bamslice -i input.bam -s $HALF -e $FILE_SIZE -o second_half.fastq
Output to stdout:
bamslice -i input.bam -s 0 -e 1000000 | head -n 4
Parallel Processing
The tool uses byte ranges, making it trivial to parallelize without coordination
Nextflow Example
See example.nf for a pipeline that pipes bamslice output through fastp for QC/filtering.
nextflow run example.nf --bam input.bam --chunk_size 104857600
How It Works
- BGZF Structure: BAM files use BGZF (Blocked GZIP) - a series of independent compressed blocks
- Block Discovery: Given a start offset, scans forward to find the next valid BGZF block (magic:
0x1f 0x8b 0x08) - Range Processing: Processes all reads from blocks starting before
end_offset - No Overlap: Each block is processed by exactly one job when using contiguous byte ranges
- FASTQ Output: Converts BAM records to interleaved FASTQ format
Why Byte Ranges?
- No indexing overhead: Don't need to scan the entire file first
- Trivial parallelization: Just choose your start/end offsets (see example nextflow)
- No coordination: Each process works independently
- Guaranteed coverage: Contiguous ranges ensure no reads are skipped
- No duplication: Block alignment ensures no reads are processed twice
Testing
Run the test suite to verify correctness:
cargo test
lint the codebase:
make lint
Development Commands
Run a coverage analysis:
make coverage && open target/coverage/html/index.html
Profile the application with samply (requires cargo install samply):
make perf
Run the performance benchmark:
make bench && open target/criterion/report/index.html
Release a new version:
echo "update Cargo.toml with new version"
git commit -m 'update package version to vX.Y.Z'
git tag -m 'tag for release' vX.Y.Z
git push --follow-tags
cargo publish
License
AGPLv3 - See LICENSE file for details
Dependencies
~7MB
~129K SLoC