Source: bbc.co.uk ยท Format: json
Description
This dataset contains more than 1 million news articles and extracted all the data points present in the news article page. BBC news articles first collected on the year 2021 and convered all the categories present in the BBC site.
This news dataset is ideal for text clasification, finding popular categories, NLP and other reasearch purposes.
Dataset is available in JSON format.
Data fields
Use cases
Analyze how news topics evolve across categories and regions over time.
Build personalized news recommendation engines using article metadata and categories.
Discover emerging themes and recurring discussions across millions of news articles.
Extract people, organizations, locations, and events from large-scale news content.
Train AI models to generate concise summaries from long-form news articles.
Compare reporting patterns across categories and analyze narrative trends.
Build semantic search engines using article content, tags, and metadata.
Study major events and reporting trends throughout 2021.
Use high-quality editorial content for NLP, LLM fine-tuning, and text understanding tasks.
Create structured relationships between entities, events, categories, and locations.
Frequently asked questions
The dataset covers all major news categories available on the BBC website during the 2021 collection period.
Yes. Both cleaned content and the original raw extracted content are included, allowing different processing approaches.
Yes. Publication dates enable chronological analysis of events, topics, and reporting patterns during 2021.
The dataset provides both the original extracted article text and a cleaned version, allowing users to choose the format that best suits their analysis.
Yes. Researchers can examine reporting patterns, category distribution, regional coverage, and editorial trends across more than one million articles.
Dataset highlights
Pre-built datasets, custom scraping, specialist feeds, and image extraction โ all from one team.
Custom scraping for any website โ fields, volume, frequency, and format to your spec. Captchas, proxies, and infrastructure fully managed.
Dedicated data platform for beauty brands and analysts. Product listings, reviews, and pricing from Sephora, Ulta, Nykaa, and 50+ retailers โ structured and updated regularly.
Bulk image downloads with custom folder structures and updated file paths in your records. Delivered via Google Drive or your dashboard.
No scrapers to build. No proxies to manage. No infrastructure to maintain.
Pre-built datasets download immediately after purchase. Custom projects scoped and delivered in days, not weeks.
Every dataset comes with a free sample download. Verify quality, structure, and field coverage before committing.
Need different fields, more volume, or a different source? We scope and build to your exact specification.
Weekly, monthly, or quarterly refresh. Delta or full-refresh delivery. Keeps your pipeline current without rebuilding.
Live chat support on every order. Enterprise projects get a dedicated account manager, SLA, and NDA options.
Publicly available data only. No IP violation, no ToS grey areas. Clear sourcing framing on every dataset.
"Saved us 100+ hours of manual data collection. Data quality is excellent and delivery was instant. Would recommend to any team that needs structured web data fast."
"The free sample let me verify quality before purchasing. Much cheaper than hiring a developer to build a scraper, and the data came clean and ready to use."
"Fresh data, reliable delivery, and a team that actually responds. The first-time buyer discount was a nice bonus. We've now used CrawlFeeds across three projects."
Browse 500+ ready datasets or submit a custom request โ we'll scope and deliver to your exact requirements.