Are the Medium articles cleaned and structured?

Yes. The Medium article dataset is cleaned and structured for analysis. It can include title, subtitle, author, publication, article body, tags, clap counts, responses, publish date, reading time, canonical URL, and other available metadata in CSV or JSON format.

Can I get comments and user-generated text data too?

Yes. CrawlFeeds can extract article comments, blog comments, discussion threads, forum posts, Reddit-style conversations, and other user-generated text data depending on the source site structure and accessibility.

Can you extract text data from websites other than Medium?

Yes. CrawlFeeds can extract article and text data from any public website, news site, blog, documentation portal, forum, or publisher. Custom fields and crawl rules can be defined based on your use case.

What formats are available for delivery?

Datasets can be delivered in CSV, JSON, JSONL, Excel, or direct cloud storage exports. For LLM and NLP use cases, text can also be prepared in chunked or document-level formats suitable for embeddings, RAG, and fine-tuning pipelines.

Can this data be used for LLM fine-tuning and RAG?

Yes. Article and comment datasets are well-suited for language model fine-tuning, retrieval-augmented generation, semantic search, summarisation, topic modelling, and classifier training. Cleaned text with metadata makes downstream processing easier.

Do you provide multilingual text data?

Yes. If the source websites publish content in multiple languages, CrawlFeeds can extract and preserve multilingual text, language tags, publication metadata, and source URLs for downstream filtering and processing.

Can I request a custom article or blog dataset?

Yes. CrawlFeeds offers custom extraction for any website or publication. You can specify the site, category, topic, date range, fields, and delivery format, and the dataset will be prepared for your research or AI workflow.

Text Datasets & Article Data - 10M+ Medium Articles,

Q: What text datasets does CrawlFeeds provide?

CrawlFeeds provides large-scale article, comment, review, forum, and blog datasets for NLP and AI. This includes 10M+ cleaned Medium articles in CSV format, plus custom extraction from other websites, blogs, publishers, and community platforms.

What We Offer

CrawlFeeds provides large-scale text datasets for AI, NLP, analytics, and search applications. This includes cleaned article corpora, blog posts, comments, discussion threads, and other web text extracted from public websites and delivered in analysis-ready formats.

One of our key text assets is a 10M+ Medium articles dataset, cleaned and available in CSV format. We also support custom article and text extraction from other websites, publishers, blogs, forums, and niche content platforms.

Large-scale text, already structured

Use article body, title, metadata, tags, author info, publication data, and comments without building your own crawling and cleaning pipeline.

Text Dataset Types We Provide

Article Datasets

News articles, blog posts, editorial content, tutorials, documentation pages, and long-form content with metadata and cleaned text body.

Comment Datasets

Article comments, blog comments, user replies, nested discussion threads, and engagement data for conversation analysis and moderation models.

Forum & Community Data

Public discussion boards, question-answer communities, niche forums, and threaded conversations for topic modelling and retrieval systems.

Publisher & Blog Archives

Historical publisher content, blog archives, category-specific article collections, and domain-specific text corpora across any public website.

AI & NLP Use Cases

LLM Fine-Tuning

Fine-tune large language models on domain-specific article corpora, writing styles, editorial formats, and user conversation data.

RAG & Semantic Search

Use cleaned article documents and metadata to build retrieval-augmented generation pipelines, enterprise search, and vector search systems.

Text Classification

Train classifiers for topic detection, content categorisation, spam detection, editorial tagging, or comment moderation.

Trend & Content Analysis

Track topic emergence, author behaviour, editorial patterns, content performance, and engagement signals across millions of articles.

Summarisation & QA

Train models for abstractive summarisation, extractive QA, answer generation, and document understanding using long-form article corpora.

Conversation Mining

Analyse comments and reply chains for stance detection, toxicity analysis, sentiment, community dynamics, and feedback mining.

10M+ Cleaned Medium Articles

CrawlFeeds already provides a large Medium article dataset with 10M+ cleaned records in CSV format. This is suited for teams that need a ready-made, large-scale text corpus for training and analysis without building a custom crawl pipeline.

Typical fields available

Article title and subtitle
Cleaned article body text
Author and publication name
Tags / topics
Claps, responses, reading time
Publish date and canonical URL

Common use cases

LLM pretraining and fine-tuning
Content recommendation research
Topic modelling and clustering
Editorial analytics and trend tracking
Long-form summarisation benchmarks
Embedding and semantic search evaluation

Custom Website Text Extraction

Need article or comment data from another website? CrawlFeeds can extract text data from any public website, including blogs, news portals, publisher archives, documentation sites, and forum-style communities.

Website

Any public source

Date Range

Historical or fresh crawl

Custom Fields

Title, body, tags, comments

Export

CSV, JSON, JSONL, cloud

We can also clean boilerplate, remove duplicated entries, preserve source URLs, and structure the content for downstream NLP pipelines.

Delivery & Processing Formats

CSV / Excel

Best for analytics teams and quick dataset inspection.

JSON / JSONL

Best for programmatic ingestion and model training pipelines.

Chunked Documents

Prepared for embeddings, vector databases, and RAG systems.

Frequently Asked Questions

CrawlFeeds provides article datasets, blog archives, comment datasets, forum data, and other web text corpora. This includes 10M+ cleaned Medium articles in CSV format plus custom extraction from other public websites.

Yes. The Medium dataset is cleaned and structured for analysis, with text body and available metadata such as title, author, tags, publication, engagement signals, and publish date.

Yes. CrawlFeeds can extract article comments, reply chains, blog comments, and discussion thread data depending on the source site.

Yes. We can extract text data from any public website, blog, publisher archive, documentation portal, or forum, with custom crawl rules and field selection.

Yes. Text datasets from CrawlFeeds are suitable for LLM fine-tuning, retrieval-augmented generation, semantic search, summarisation, classification, and content analytics pipelines.

We deliver CSV, JSON, JSONL, Excel, and cloud exports. We can also prepare chunked text for embedding pipelines and vector database ingestion.

Yes. If the source site is multilingual, CrawlFeeds can preserve text content, language context, and metadata so the final dataset can be filtered by language.

Yes. Share the website, topic, date range, and required fields, and CrawlFeeds can prepare a custom article or comment dataset for your research, analytics, or AI workflow.

Need Large-Scale Text Data?

Use our 10M+ Medium articles dataset or request custom extraction for articles, comments, blogs, and other text-heavy websites.

Browse Datasets Request Custom Text Data

Text Datasets & Article Data

Text Data