In the field of Computational Pathology (CPath), the application of multi-modal datasets is becoming increasingly important. Today, we have compiled a super comprehensive summary of multi-modal CPath datasets. Whether you are engaged in pathology AI research, model training, or just want to learn about the latest data resources, you can find what you need here!
This article categorizes the datasets into Image-Text Pair Datasets and Multi-Modal Instruction Datasets, detailing the description, staining type, data source, public availability, and whether large models assisted in generation for each dataset.
📸 Image-Text Pair Datasets
1.QUILT
oData Type: Slice-Description Pair
oDescription: 437,878 slices, 802,404 descriptions, from 4,475 videos
oStaining: H&E (H), IHC (I), Others (O)
oSource: YouTube
oPublic Availability: ✅
oLarge Model Assistance: ❌
2.PathCap
oData Type: Slice-Description Pair
oDescription: 208k pathology slice-description pairs
oStaining: H, I, O
oSource: PubMed
oPublic Availability: ✅
oLarge Model Assistance: ❌
3.OpenPath
oData Type: Slice-Description Pair
oDescription: 208,014 slice-description pairs
oStaining: I, O
oSource: WSI-Twitter, Open Source Libraries, Internet
oPublic Availability: ✅
oLarge Model Assistance: ❌
4.CONCH
oData Type: Slice-Description Pair
oDescription: 1,170,674 slice-description pairs
oStaining: H, I
oSource: PMC-OA
oPublic Availability: ✅
oLarge Model Assistance: ❌
5.HistGen
oData Type: Whole Slide Image (WSI)-Report Pair
oDescription: 75,723 pairs
oStaining: H
oSource: PMC-OA
oPublic Availability: ✅
oLarge Model Assistance: ❌
6.Mass-3QK
oData Type: WSI
oDescription: 335,665 WSIs covering 20 organs
oStaining: H, M, I
oSource: GTEx
oPublic Availability: ❌
oLarge Model Assistance: ❌
7.CAPTION-PATCH CAPTION
oData Type: Slice-Description Pair
oDescription: 10.5 million pairs
oStaining: H, I, O
oSource: TCGA
oPublic Availability: ✅
oLarge Model Assistance: ❌
8.MUNICH
oData Type: WSI-Report Pair
oDescription: 15,129 pairs from 6,705 patients
oStaining: I
oSource: TCGA
oPublic Availability: ✅
oLarge Model Assistance: ❌
9.PCAPTION-C
oData Type: Slice-Description Pair
oDescription: 1,409,058 pairs, cleaned (removing non-human pathology data and short texts)
oStaining: H, I, O
oSource: PMC-OA, QUILT-1M
oPublic Availability: ✅
oLarge Model Assistance: ✅
10.ARCHI
oData Type: Package-Description Pair
oDescription: 21,186 packages containing 33,480 slice-description pairs
oStaining: H, I, O
oSource: PubMed
oPublic Availability: ✅
oLarge Model Assistance: ❌
11.MI-ZERO
oData Type: Slice-Description Pair
oDescription: Slice-description pairs from educational resources
oStaining: H, I, O
oSource: ARCHI
oPublic Availability: ✅
oLarge Model Assistance: ❌
✍️ Multi-Modal Instruction Datasets
1.PathInstrucT
oData Type: Slice-Level Instructions
oDescription: 180k multi-modal instruction samples
oStaining: H, I, O
oSource: YouTube
oPublic Availability: ✅
oLarge Model Assistance: ❌
2.CAPTION-PATCH Instruction
oData Type: Slice-Level Instructions
oDescription: 351,871 samples covering description generation, visual question answering (VQA), and classification tasks
oStaining: H
oSource: CAPTION-VQA, PathGen, CAPTION-PATCH
oPublic Availability: ✅
oLarge Model Assistance: ✅
3.CAPTI-WSI Instruction
oData Type: WSI-Level Instructions
oDescription: 7,312 WSI-level samples
oStaining: H
oSource: HistGen
oPublic Availability: ✅
oLarge Model Assistance: ❌
4.QUILT-Instruct
oData Type: Question-Answer Pairs (VQA)
oDescription: 107,131 question-answer pairs
oStaining: H
oSource: YouTube
oPublic Availability: ✅
oLarge Model Assistance: ❌
5.PathCapQ&A Bench
oData Type: Slice-Level Instructions
oDescription: 456,916 instructions, 999,022 question-answer pairs
oStaining: H
oSource: PMC-OA, TCGA
oPublic Availability: ✅
oLarge Model Assistance: ✅
6.CLOVER
oData Type: Instructions
oDescription: 45,000 question-answer instructions
oStaining: I
oSource: PathVQA
oPublic Availability: ✅
oLarge Model Assistance: ❌