Advancing Neuroscience Research with Visual Question Answering and Multimodal Retrieval

Leading healthcare organizations are turning to generative AI to help build applications that can deliver life-saving impacts. These organizations include the Indian Institute of Technology Madras – IIT Madras Brain Centre. Advancing neuroscience research, the IIT Madras Brain Centre is using AI to generate analyses of whole human brains at a cellular level across various demographics.

The Centre has developed a unique knowledge exploration framework using visual question-answering (VQA) models and large language models (LLMs) to make brain imaging data more accessible to the neuroscience community. This post showcases a proof of concept on how AI can push the limits of neuroscience research. By building a multimodal framework that blends VQA models with LLMs, the team has found a way to make brain imaging data easier to understand. This approach is helping researchers uncover new insights into brain structure and function, setting the stage for breakthroughs that could lead to life-saving discoveries.

Neuroscience knowledge exploration framework

The knowledge exploration framework leverages neuroscience publications to help researchers link brain imaging data with the latest neuroscience research. With this tool, researchers can explore recent advancements related to brain images and discoveries in particular brain regions, such as the causes of specific conditions seen in the imaging data. They can also track the current state of any neuroscience research area and find answers to related queries.

The framework’s processing pipeline consists of two parts:

Ingestion: Indexes the latest neuroscience publications into the knowledge base.
Q&A: Enables users to interact with the knowledge base using queries. The latest neuroscience publications are downloaded from a publicly available database and processed in the ingestion pipeline. The texts are then extracted paragraph by paragraph. We use a domain-specific, fine-tuned embedding model to generate embeddings for each paragraph. These embeddings are then indexed into a vector database.

The Q&A section is a multimodal retrieval-augmented generation RAG pipeline that enables users to interact with both text and images. This section filters user inputs to remove any irrelevant or toxic content from the supplied text. The relevant passages are then retrieved using a hybrid similarity matching approach that combines semantic and keyword similarity. The retrieved passages are subsequently ranked using a reranker model. Finally, the top two paragraphs are passed onto a language model for answer generation.

Visual question answering and multimodal retrieval

Users can interact with the framework using images of brain regions and ask questions about the displayed image. The framework employs the latest VQA models for biomedical domains, such as Llava-Med, to provide answers. Additionally, the framework enables the retrieval of similar images based on a given image or text. This part of the pipeline is still under development and requires further refinement.

Diagram of a Visual Question Answering (VQA) pipeline. A user query goes through NeMo Guardrails to a Vector Database and is processed by NeMo Retriever Embedding NIM. It retrieves similar content, re-ranked by NeMo Retriever Re-ranking NIM. Llava-Med provides a VQA answer, and the Mixtral-8x7B NVIDIA NIM generates the final response. — *Figure 1. VQA end-to-end pipeline architecture*

Using NVIDIA technology to overcome research challenges

The NVIDIA technology stack powers the processing pipeline of the knowledge base framework. Various NVIDIA tools and frameworks have been used to ensure the robustness and performance of this pipeline. Developing the multiple parts of the pipeline posed several challenges, all successfully addressed with the help of NVIDIA technologies.

Improving retrieval accuracy

The framework includes a specialized knowledge base centered on neuroscience publications. Since generic embedding models weren’t originally trained on this kind of data, fine-tuning was needed to improve retrieval accuracy. Manually creating a fine-tuning dataset at scale is challenging and requires input from neuroscience experts, so a synthetic dataset was generated with a large language model (LLM). To support large-scale dataset development, fast LLM inference is essential; the Mixtral 8x 7B NVIDIA NIM microservice was used to boost inference speed. Fine-tuning the embedding model improved the retrieval accuracy of the top two results by 15.25%.

Retrieval accuracy was further enhanced with NVIDIA NeMo Retriever, a set of NIM microservices for information retrieval. The nv-rerank-qa-mistral-4b_v2 NIM microservice was used to rerank retrieved paragraphs, boosting top-2 retrieval accuracy by another 15.27%.

User input filtering

To ensure only relevant content reaches users, researchers at IIT Madras used NVIDIA NeMo Guardrails for filtering. They implemented a user input guardrail using the Llama Guard 2 8B language model and developed a custom prompt tailored to neuroscience. This prompt was tested with a public toxic chat database to assess its ability to block irrelevant questions and evaluated with neuroscience-specific questions to confirm it accepted relevant ones. Results showed:

38% of toxic content was blocked by the default prompt
68% of toxic content was blocked by the custom prompt
98% of neuroscience-specific questions were accepted by the custom prompt (based on a custom dataset)

Inferencing speed for answer generation

With multiple users accessing the system simultaneously, generating answers within a reasonable time was challenging. This challenge was overcome using the LLama 3.1 70B NIM running on the NVIDIA DGX A100 servers. Using LLama 3.1 70B NIM on an NVIDIA DGX A100 server, the inferencing speed is 4 times faster than the custom-developed inferencing code.

NVIDIA AI Blueprint for multimodal PDF extraction

The recently launched NVIDIA AI Blueprint for the multimodal PDF data extraction can be used to accurately extract relevant information from neuroscience publications, and has potential to be connected to the retrieval pipeline described above.

This workflow is designed to help organizations accurately extract knowledge contained within PDF documents that are a common form of storing publications and research information. NVIDIA has created a RAG model that uses NeMo Retriever NIM microservices to parse PDFs that include text, images, charts, tables, plots, and other diagrams.

Workflow of the NVIDIA AI Blueprint for the multimodal PDF data extraction showing the retrieval pipeline (top) and ingestion pipeline (bottom). — Figure 2. The NVIDIA AI Blueprint for the multimodal PDF data extraction workflow ingests and extracts highly accurate insights contained in text, graphs, charts, and tables within massive volumes of PDF documents

NVIDIA Blueprints provide sample apps built with NVIDIA NIM and partner services, along with reference code, customization guides, and a Helm chart for deployment. These blueprints give scientific developers a jump start on building their own applications using powerful NVIDIA AI tools. They’re flexible and can be adapted and expanded to suit research projects, enabling developers to create sophisticated workflows for complex biomedical literature review tasks.

Examples

This section presents examples of visual question answering and image-to-image retrieval, showcasing how these LLM techniques can be applied in neuroscience research. The first example illustrates how visual question answering can identify specific brain regions from input images. Additional examples show how image-to-image retrieval can be achieved through comparisons of microscopic images of tissue sections, highlighting the ability to retrieve similar samples based on visual characteristics.

Visual question answering example #1

Figure 3 is the input image for an example of visual question answering.

Question: What is the brain region in the image?
Answer: The brain region in the image is the cerebellum, which is a part of the brain that plays an important role in motor control, coordination, and balance.

Image of a cerebellum. — *Figure 3. Input image of a cerebellum for visual question answering*

Visual question answering example #2

Figure 4 is the input image for a second example of visual question answering.

Question: Does this image show the frontal cortex?
Answer: Yes, the image shows the frontal cortex, which is a region of the brain.

Image of the frontal cortex of the brain. — *Figure 4. Input image of the frontal cortex of the brain for visual question answering*

Image-to-image retrieval example #1

Figure 5 shows two microscopic images of tissue sections, side-by-side. The left image shows an input tissue sample with a purple stain and some structural detail, while the right image shows a retrieved, similar-looking tissue sample with comparable staining and shape, demonstrating image-to-image retrieval.

Two side-by-side microscopic images of tissue sections are shown. One image features a purple-stained input sample with distinct structural details, while the other displays a retrieved tissue sample that closely resembles it in staining and shape, illustrating image-to-image retrieval. — *Figure 5. Input image of a tissue section (left) and the retrieved image (right)*

Image-to-image retrieval example #2

Figure 6 shows two microscopic images of brain tissue sections, side-by-side. The left image shows an input sample stained in purple, with distinct areas of light and dark textures and several elongated, lighter streaks. The right image shows a retrieved tissue sample with a similar overall shape, staining, and texture patterns, used to demonstrate image-to-image retrieval.

Two adjacent images of brain tissue, one image depicts an input sample stained purple, showcasing varied light and dark textures alongside several elongated lighter streaks. The other image displays a retrieved tissue sample that mirrors the shape, staining, and texture patterns of the input sample, highlighting the process of image-to-image retrieval. — *Figure 6. Input image of brain tissue (left) and the retrieved image (right)*

Summary

The IIT Madras Brain Centre and NVIDIA accelerated computing and AI technologies—including NVIDIA NeMo, NVIDIA NIM, NVIDIA NVIDIA AI Blueprints, and NVIDIA DGX—are helping to advance neuroscience research, opening new avenues for understanding brain structure and function, and accelerating research that could lead to life-saving discoveries.

Explore NVIDIA NIM for Healthcare.