What sources are included in CrawlFeeds' medical dataset?

The CrawlFeeds Medical Q&A dataset combines data from three sources: iCliniq (50,000+ doctor-answered Q&A pairs), HealthTap (conversational patient questions), and WebMD (medically reviewed health articles). Together they cover 80+ medical specialities.

How many records are in the medical Q&A dataset?

The full dataset contains 81,475 records with 32 data fields, including question text, doctor answers, medical categories, author speciality, source domain, and more.

Is a free sample of the medical dataset available?

Yes. A free 1,000 record sample is available on Hugging Face at huggingface.co/datasets/crawlfeeds/Medical-Health-QA-Articles-Dataset for evaluation before purchase.

Can I get data for a specific medical speciality only?

Yes. Custom subsets filtered by speciality (e.g. cardiology, dermatology, psychiatry), source platform, content type, or medical category are available on request through CrawlFeeds.

Healthcare Datasets: Medical Q&A & Clinical Data

Q: Can I use this dataset to fine-tune an LLM for healthcare?

Yes. The body_text and answer_text fields provide dense, domain-specific medical language suitable for fine-tuning GPT, LLaMA, Mistral, and other open-source LLMs on healthcare domain text.

Q: What format is the medical dataset available in?

The dataset is available in CSV and JSON formats, ready for immediate download. Custom subsets by speciality, source, or content type are available on request.

Q: How do medical Q&A datasets improve clinical decision support systems?

Medical Q&A datasets expose CDSS to real patient cases covering symptom patterns, rare conditions, and treatment reasoning — enabling contextual, accurate clinical recommendations beyond static rule-based logic.

What Is the CrawlFeeds Medical Q&A & Articles Dataset?

The CrawlFeeds Medical Q&A & Articles Dataset is a multi-source collection of doctor-answered patient questions and medically reviewed health articles — sourced from three of the most trusted health platforms on the internet:

iCliniq

50,000+ Q&A pairs answered by licensed doctors across 80+ specialities

HealthTap

Conversational patient questions covering symptoms, diagnoses, and treatments

WebMD

Medically reviewed long-form articles on conditions, drugs, and treatments

All three sources share a unified schema — no separate parsers needed. Load once, use across every pipeline.

81,475 Verified Medical Records

32 structured fields · 80+ medical specialities · Q&A pairs + long-form articles in a single dataset

Dataset Coverage & Fields

The dataset uses a consistent schema across all three sources. Key fields include:

Field	Coverage	Description
`uniq_id`	100%	Unique record identifier
`content_type`	100%	"qa" or "article"
`source`	100%	Platform name (iCliniq, HealthTap, WebMD)
`title`	100%	Question title or article headline
`body_text`	100%	Patient question or article body
`answer_text`	100%	Doctor's answer (Q&A records)
`category`	99.7%	Medical category
`author_speciality`	99.7%	Doctor's medical speciality
`abstract`	93.9%	Short content summary
`author_name`	100%	Author or answering doctor
`page_url`	100%	Direct content URL
`language`	100%	Language code

Additional fields in the full commercial dataset: condition tags, body part classification, medical review status, helpful counts, view counts, and publication dates.

Use Cases

Medical Chatbot Training

Train conversational AI on real doctor-patient dialogue. The natural question format from HealthTap combined with verified iCliniq answers provides ideal training pairs for symptom checker and triage assistant development.

LLM Fine-Tuning

Fine-tune GPT, LLaMA, Mistral, or any open-source LLM on high-quality medical text. The body_text and answer_text fields provide dense, domain-specific language that generic pretraining data lacks.

RAG Pipelines

Build medical knowledge bases for Retrieval-Augmented Generation using WebMD's authoritative articles. Combine with Q&A pairs to answer patient questions with verified source material.

Clinical NLP Research

Use for named entity recognition, relation extraction, medical text classification, and information retrieval benchmarks. The category and author_speciality fields provide ready-made labels.

Symptom Checker Development

Train symptom-to-condition mapping models using HealthTap and iCliniq's symptom-focused questions. Ideal for differential diagnosis systems and patient triage tools.

Medical Education AI

Build study tools, exam preparation assistants, and clinical reasoning trainers using Q&A content spanning 80+ specialities — from cardiology to psychiatry to paediatrics.

Who Is This Dataset For?

AI/ML engineers building medical chatbots, symptom checkers, or clinical assistants
NLP researchers working on healthcare text classification, NER, or question answering
LLM teams fine-tuning models for medical domain adaptation
Health tech startups building patient-facing AI products
Academic researchers studying medical dialogue, health information seeking, or clinical NLP
Pharma & insurance companies building internal AI tools for claims, triage, or research

How Medical Q&A Data Improves Clinical Decision Support Systems

Clinical decision support systems (CDSS) trained on generic datasets lack the real-world patient complexity that determines accuracy in practice. Medical Q&A datasets solve this gap in eight concrete ways:

Improve Diagnostic Accuracy — expose CDSS to thousands of real symptom patterns across demographics and rare conditions.

Provide Contextual Clinical Insights — include patient history, lifestyle factors, and symptom progression missing from structured records.

Enhance Treatment Recommendations — real doctor treatment plans covering medications, alternatives, and risk considerations.

Support Rare & Edge Case Detection — uncommon symptom combinations and specialist insights underrepresented in traditional datasets.

Strengthen Patient Query Understanding — map natural patient language to clinical terms for better digital health platform UX.

Accelerate Clinical Decision-Making — retrieve relevant case patterns quickly, reducing analysis time in time-critical settings.

Enable Continuous Learning — updated discussions on emerging treatments and real-time patient concerns keep CDSS current.

Improve Model Training Quality — verified answers enable supervised learning with reduced bias and improved clinical safety.

Read the full deep dive: 8 Ways Medical Q&A Datasets Improve Clinical Decision Support Systems →

Get the Dataset

Medical Q&A & Articles Dataset

iCliniq · HealthTap · WebMD · 81,475 records · 32 fields · CSV & JSON

Healthcare NLP LLM Fine-Tuning Clinical AI RAG

Access Dataset

Free 1,000 record sample on Hugging Face →

Blog Resources

Dive deeper into how this dataset is being used in practice:

Healthcare AI

8 Ways Medical Q&A Datasets Improve Clinical Decision Support Systems

How doctor-verified Q&A data closes the context gap in CDSS — covering diagnostic accuracy, rare case detection, and continuous learning.

Read Article →

Dataset Guide

Medical Q&A Dataset: Doctor-Answered Health Data from iCliniq, HealthTap & WebMD

Full breakdown of what's in the dataset — schema, field coverage, sample data, Python loading code, and use case walkthroughs.

Read Article →

Frequently Asked Questions

A medical Q&A dataset is a structured collection of patient questions paired with verified doctor answers, sourced from platforms like iCliniq, HealthTap, and WebMD. It is used to train clinical AI models, symptom checkers, medical chatbots, and LLMs for healthcare domain adaptation.

The full dataset contains 81,475 records with 32 data fields, including question text, verified doctor answers, medical categories, author speciality, source domain, content type, and more. The iCliniq subset alone contains over 50,000 doctor-answered Q&A pairs.

Yes. The body_text and answer_text fields provide dense, domain-specific medical language suitable for fine-tuning GPT, LLaMA, Mistral, and other open-source LLMs. A Python snippet for loading and filtering the data is available in our dataset guide.

Yes. A free 1,000 record sample is available on Hugging Face for evaluation before purchase. The sample includes records from all three sources (iCliniq, HealthTap, WebMD) across multiple content types.

The dataset is available in CSV and JSON formats, ready for immediate download. Custom subsets by speciality, source platform, content type, or medical category are available on request. Contact contact@crawlfeeds.com for enterprise or custom requirements.

Yes. The dataset covers 80+ medical specialities via the author_speciality field — from cardiology and dermatology to psychiatry and paediatrics. Custom subsets by speciality are available on request.

This dataset provides high-quality training signal from verified medical professionals. For production clinical applications, we recommend combining it with additional validated clinical datasets and human expert review of model outputs before deployment.

Unlike exam-style benchmarks (MedQA, MedMCQA), this dataset contains real doctor-patient conversations — the kind of natural, symptom-driven dialogue your model will encounter in production. It combines Q&A pairs and long-form medical articles in a single unified schema, with source attribution and specialty filtering not available in most public alternatives.

Ready to Build Healthcare AI?

Access 81,475 doctor-verified medical records today. Free 1K sample available before you commit.

Browse Datasets Request Custom Data

Healthcare & Medical Datasets

Healthcare Datasets