Name		Name	Last commit message	Last commit date
parent directory ..
Labeler		Labeler
Upload		Upload
.env		.env
01_download_filelists.py		01_download_filelists.py
02_batch_filelist.py		02_batch_filelist.py
03_download_batch_ftp.py		03_download_batch_ftp.py
03_download_batch_ftp.sh		03_download_batch_ftp.sh
04_build_json.py		04_build_json.py
04_build_json.sh		04_build_json.sh
05_json_to_entrez.py		05_json_to_entrez.py
05a_file_path_batching.py		05a_file_path_batching.py
05b_json_to_entrez_batch.sh		05b_json_to_entrez_batch.sh
05c_remaining_json.py		05c_remaining_json.py
06_serilaize_to_webdataset_parallel.py		06_serilaize_to_webdataset_parallel.py
06_serilaize_to_webdataset_parallel.sh		06_serilaize_to_webdataset_parallel.sh
README.md		README.md
echo_env.sh		echo_env.sh

README.md

To parquet or not to parquet?

As a general rule of thumb, for training data around 1 million datapoints, a Parquet dataset is sufficient. However, as the dataset size increases (e.g., 5 million or more), switching to a WebDataset can provide up to a 5x speed improvement during training.

We envision OpenBioScience to be the foundation for training the next generation of both specialized (e.g., pathology, radiology) and general biomedical models. To train specialized models, it's necessary to filter OpenBioScience, which is a task well-suited for Parquet datasets.

Therefore, we offer our dataset in both formats:

WebDataset – optimized for large-scale model training, especially when streaming data.
Parquet Dataset – ideal for performing fast, structured filtering and querying of the dataset.

Both formats support streaming!

If you're interested in the tradeoffs between Parquet and WebDataset, please read our last section on Serialization.

How do we serialzie OpenBioSciecne?

Serialization

WebDataset and Parquet Dataset are two different formats used for storing and handling large datasets. Each has its own advantages and use cases. Let's break down their differences:

1. Data Storage and Format

WebDataset: This format stores data as sequences of files (e.g., .tar archives or individual files). Each file contains a data sample, which can be an image, label, or any other modality (text, img, etc.). These files are often stored in .txt, .jpeg respectively.
Parquet Dataset: This is a columnar storage format designed for high-performance analytical queries. Parquet compresses and encodes the data efficiently.

2. Use Cases

WebDataset: Ideal for large-scale image datasets or multimodal data stored as individual files. It's particularly useful for streaming data (e.g., during training) without loading the entire dataset into memory. This makes it effective for distributed training scenarios, where data is read dynamically from remote locations.
Parquet Dataset: Primarily used for storing structured, relational, or tabular data. Parquet is efficient when running analytical queries over large datasets due to its columnar layout. It excels at compression and fast retrieval, making it suitable for databases or batch processing.

3. Scalability

WebDataset: Scales well in distributed and cloud environments since data can be stored in object storage (e.g., S3, Google Cloud Storage) and streamed without needing to download the entire dataset. It’s designed to handle large, unstructured datasets (e.g., images, audio).
Parquet Dataset: Scales well for structured data and is highly performant in big data frameworks. It works well for datasets that need fast columnar access and are analyzed using database-like operations.

Summary Table:

Aspect	WebDataset	Parquet Dataset
Format Type	File-based (e.g., `.tar`, `.jpg`, `.png`)	Columnar (e.g., `.parquet`)
Best Use Case	Image/audio/text streaming	Tabular, structured data
Data Access	Streaming, file-by-file	Columnar, selective column retrieval
Performance	Efficient for I/O-heavy, unstructured data	Efficient for batch processing, analytics
Integration	PyTorch, TensorFlow, ML pipelines	Spark, Hadoop, pandas, Arrow
File Size	Many small files or archives	Large, highly compressed files
Scalability	Good for distributed/cloud environments	Good for structured, big data ecosystems

The choice between WebDataset and Parquet depends largely on the type of data you're working with and the specific performance characteristics required by your workflow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

README.md

To parquet or not to parquet?

How do we serialzie OpenBioSciecne?

Serialization

1. Data Storage and Format

2. Use Cases

3. Scalability

Summary Table:

Files

scripts

Directory actions

More options

Directory actions

More options

Latest commit

History

scripts

Folders and files

parent directory

README.md

To parquet or not to parquet?

How do we serialzie OpenBioSciecne?

Serialization

1. Data Storage and Format

2. Use Cases

3. Scalability

Summary Table: