What Is the CrawlFeeds Medical Q&A & Articles Dataset?
The CrawlFeeds Medical Q&A & Articles Dataset is a multi-source collection of doctor-answered patient questions and medically reviewed health articles โ sourced from three of the most trusted health platforms on the internet:
All three sources share a unified schema โ no separate parsers needed. Load once, use across every pipeline.
81,475 Verified Medical Records
32 structured fields ยท 80+ medical specialities ยท Q&A pairs + long-form articles in a single dataset
Dataset Coverage & Fields
The dataset uses a consistent schema across all three sources. Key fields include:
| Field | Coverage | Description |
|---|---|---|
uniq_id | 100% | Unique record identifier |
content_type | 100% | "qa" or "article" |
source | 100% | Platform name (iCliniq, HealthTap, WebMD) |
title | 100% | Question title or article headline |
body_text | 100% | Patient question or article body |
answer_text | 100% | Doctor's answer (Q&A records) |
category | 99.7% | Medical category |
author_speciality | 99.7% | Doctor's medical speciality |
abstract | 93.9% | Short content summary |
author_name | 100% | Author or answering doctor |
page_url | 100% | Direct content URL |
language | 100% | Language code |
Additional fields in the full commercial dataset: condition tags, body part classification, medical review status, helpful counts, view counts, and publication dates.
Use Cases
Medical Chatbot Training
Train conversational AI on real doctor-patient dialogue. The natural question format from HealthTap combined with verified iCliniq answers provides ideal training pairs for symptom checker and triage assistant development.
LLM Fine-Tuning
Fine-tune GPT, LLaMA, Mistral, or any open-source LLM on high-quality medical text. The body_text and answer_text fields provide dense, domain-specific language that generic pretraining data lacks.
RAG Pipelines
Build medical knowledge bases for Retrieval-Augmented Generation using WebMD's authoritative articles. Combine with Q&A pairs to answer patient questions with verified source material.
Clinical NLP Research
Use for named entity recognition, relation extraction, medical text classification, and information retrieval benchmarks. The category and author_speciality fields provide ready-made labels.
Symptom Checker Development
Train symptom-to-condition mapping models using HealthTap and iCliniq's symptom-focused questions. Ideal for differential diagnosis systems and patient triage tools.
Medical Education AI
Build study tools, exam preparation assistants, and clinical reasoning trainers using Q&A content spanning 80+ specialities โ from cardiology to psychiatry to paediatrics.
Who Is This Dataset For?
- AI/ML engineers building medical chatbots, symptom checkers, or clinical assistants
- NLP researchers working on healthcare text classification, NER, or question answering
- LLM teams fine-tuning models for medical domain adaptation
- Health tech startups building patient-facing AI products
- Academic researchers studying medical dialogue, health information seeking, or clinical NLP
- Pharma & insurance companies building internal AI tools for claims, triage, or research
How Medical Q&A Data Improves Clinical Decision Support Systems
Clinical decision support systems (CDSS) trained on generic datasets lack the real-world patient complexity that determines accuracy in practice. Medical Q&A datasets solve this gap in eight concrete ways:
Read the full deep dive: 8 Ways Medical Q&A Datasets Improve Clinical Decision Support Systems โ
Get the Dataset
Medical Q&A & Articles Dataset
iCliniq ยท HealthTap ยท WebMD ยท 81,475 records ยท 32 fields ยท CSV & JSON
Blog Resources
Dive deeper into how this dataset is being used in practice:
8 Ways Medical Q&A Datasets Improve Clinical Decision Support Systems
How doctor-verified Q&A data closes the context gap in CDSS โ covering diagnostic accuracy, rare case detection, and continuous learning.
Read Article โMedical Q&A Dataset: Doctor-Answered Health Data from iCliniq, HealthTap & WebMD
Full breakdown of what's in the dataset โ schema, field coverage, sample data, Python loading code, and use case walkthroughs.
Read Article โFrequently Asked Questions
body_text and answer_text fields provide dense, domain-specific medical language suitable for fine-tuning GPT, LLaMA, Mistral, and other open-source LLMs. A Python snippet for loading and filtering the data is available in our dataset guide.
author_speciality field โ from cardiology and dermatology to psychiatry and paediatrics. Custom subsets by speciality are available on request.
Ready to Build Healthcare AI?
Access 81,475 doctor-verified medical records today. Free 1K sample available before you commit.