What We Offer
CrawlFeeds provides large-scale text datasets for AI, NLP, analytics, and search applications. This includes cleaned article corpora, blog posts, comments, discussion threads, and other web text extracted from public websites and delivered in analysis-ready formats.
One of our key text assets is a 10M+ Medium articles dataset, cleaned and available in CSV format. We also support custom article and text extraction from other websites, publishers, blogs, forums, and niche content platforms.
Large-scale text, already structured
Use article body, title, metadata, tags, author info, publication data, and comments without building your own crawling and cleaning pipeline.
Text Dataset Types We Provide
Article Datasets
News articles, blog posts, editorial content, tutorials, documentation pages, and long-form content with metadata and cleaned text body.
Comment Datasets
Article comments, blog comments, user replies, nested discussion threads, and engagement data for conversation analysis and moderation models.
Forum & Community Data
Public discussion boards, question-answer communities, niche forums, and threaded conversations for topic modelling and retrieval systems.
Publisher & Blog Archives
Historical publisher content, blog archives, category-specific article collections, and domain-specific text corpora across any public website.
AI & NLP Use Cases
LLM Fine-Tuning
Fine-tune large language models on domain-specific article corpora, writing styles, editorial formats, and user conversation data.
RAG & Semantic Search
Use cleaned article documents and metadata to build retrieval-augmented generation pipelines, enterprise search, and vector search systems.
Text Classification
Train classifiers for topic detection, content categorisation, spam detection, editorial tagging, or comment moderation.
Trend & Content Analysis
Track topic emergence, author behaviour, editorial patterns, content performance, and engagement signals across millions of articles.
Summarisation & QA
Train models for abstractive summarisation, extractive QA, answer generation, and document understanding using long-form article corpora.
Conversation Mining
Analyse comments and reply chains for stance detection, toxicity analysis, sentiment, community dynamics, and feedback mining.
10M+ Cleaned Medium Articles
CrawlFeeds already provides a large Medium article dataset with 10M+ cleaned records in CSV format. This is suited for teams that need a ready-made, large-scale text corpus for training and analysis without building a custom crawl pipeline.
Typical fields available
- Article title and subtitle
- Cleaned article body text
- Author and publication name
- Tags / topics
- Claps, responses, reading time
- Publish date and canonical URL
Common use cases
- LLM pretraining and fine-tuning
- Content recommendation research
- Topic modelling and clustering
- Editorial analytics and trend tracking
- Long-form summarisation benchmarks
- Embedding and semantic search evaluation
Custom Website Text Extraction
Need article or comment data from another website? CrawlFeeds can extract text data from any public website, including blogs, news portals, publisher archives, documentation sites, and forum-style communities.
We can also clean boilerplate, remove duplicated entries, preserve source URLs, and structure the content for downstream NLP pipelines.
Delivery & Processing Formats
CSV / Excel
Best for analytics teams and quick dataset inspection.
JSON / JSONL
Best for programmatic ingestion and model training pipelines.
Chunked Documents
Prepared for embeddings, vector databases, and RAG systems.
Frequently Asked Questions
Need Large-Scale Text Data?
Use our 10M+ Medium articles dataset or request custom extraction for articles, comments, blogs, and other text-heavy websites.