What We Offer

CrawlFeeds provides large-scale text datasets for AI, NLP, analytics, and search applications. This includes cleaned article corpora, blog posts, comments, discussion threads, and other web text extracted from public websites and delivered in analysis-ready formats.

One of our key text assets is a 10M+ Medium articles dataset, cleaned and available in CSV format. We also support custom article and text extraction from other websites, publishers, blogs, forums, and niche content platforms.

Large-scale text, already structured

Use article body, title, metadata, tags, author info, publication data, and comments without building your own crawling and cleaning pipeline.


Text Dataset Types We Provide

Article Datasets

News articles, blog posts, editorial content, tutorials, documentation pages, and long-form content with metadata and cleaned text body.

Comment Datasets

Article comments, blog comments, user replies, nested discussion threads, and engagement data for conversation analysis and moderation models.

Forum & Community Data

Public discussion boards, question-answer communities, niche forums, and threaded conversations for topic modelling and retrieval systems.

Publisher & Blog Archives

Historical publisher content, blog archives, category-specific article collections, and domain-specific text corpora across any public website.


AI & NLP Use Cases

LLM Fine-Tuning

Fine-tune large language models on domain-specific article corpora, writing styles, editorial formats, and user conversation data.

RAG & Semantic Search

Use cleaned article documents and metadata to build retrieval-augmented generation pipelines, enterprise search, and vector search systems.

Text Classification

Train classifiers for topic detection, content categorisation, spam detection, editorial tagging, or comment moderation.

Trend & Content Analysis

Track topic emergence, author behaviour, editorial patterns, content performance, and engagement signals across millions of articles.

Summarisation & QA

Train models for abstractive summarisation, extractive QA, answer generation, and document understanding using long-form article corpora.

Conversation Mining

Analyse comments and reply chains for stance detection, toxicity analysis, sentiment, community dynamics, and feedback mining.


10M+ Cleaned Medium Articles

CrawlFeeds already provides a large Medium article dataset with 10M+ cleaned records in CSV format. This is suited for teams that need a ready-made, large-scale text corpus for training and analysis without building a custom crawl pipeline.

Typical fields available
  • Article title and subtitle
  • Cleaned article body text
  • Author and publication name
  • Tags / topics
  • Claps, responses, reading time
  • Publish date and canonical URL
Common use cases
  • LLM pretraining and fine-tuning
  • Content recommendation research
  • Topic modelling and clustering
  • Editorial analytics and trend tracking
  • Long-form summarisation benchmarks
  • Embedding and semantic search evaluation

Custom Website Text Extraction

Need article or comment data from another website? CrawlFeeds can extract text data from any public website, including blogs, news portals, publisher archives, documentation sites, and forum-style communities.

Website
Any public source
Date Range
Historical or fresh crawl
Custom Fields
Title, body, tags, comments
Export
CSV, JSON, JSONL, cloud

We can also clean boilerplate, remove duplicated entries, preserve source URLs, and structure the content for downstream NLP pipelines.


Delivery & Processing Formats

CSV / Excel

Best for analytics teams and quick dataset inspection.

JSON / JSONL

Best for programmatic ingestion and model training pipelines.

Chunked Documents

Prepared for embeddings, vector databases, and RAG systems.


Frequently Asked Questions

CrawlFeeds provides article datasets, blog archives, comment datasets, forum data, and other web text corpora. This includes 10M+ cleaned Medium articles in CSV format plus custom extraction from other public websites.

Yes. The Medium dataset is cleaned and structured for analysis, with text body and available metadata such as title, author, tags, publication, engagement signals, and publish date.

Yes. CrawlFeeds can extract article comments, reply chains, blog comments, and discussion thread data depending on the source site.

Yes. We can extract text data from any public website, blog, publisher archive, documentation portal, or forum, with custom crawl rules and field selection.

Yes. Text datasets from CrawlFeeds are suitable for LLM fine-tuning, retrieval-augmented generation, semantic search, summarisation, classification, and content analytics pipelines.

We deliver CSV, JSON, JSONL, Excel, and cloud exports. We can also prepare chunked text for embedding pipelines and vector database ingestion.

Yes. If the source site is multilingual, CrawlFeeds can preserve text content, language context, and metadata so the final dataset can be filtered by language.

Yes. Share the website, topic, date range, and required fields, and CrawlFeeds can prepare a custom article or comment dataset for your research, analytics, or AI workflow.

Need Large-Scale Text Data?

Use our 10M+ Medium articles dataset or request custom extraction for articles, comments, blogs, and other text-heavy websites.