Datasets
Asosiy manbalar
Kaggle
- kaggle.com/datasets — minglab dataset
- kaggle.com/competitions — competitions (real problems)
- Klassik ML uchun #1 manba
- Notebook'lar bilan birga
Hugging Face Datasets
- huggingface.co/datasets — NLP/CV/Audio
- 100,000+ dataset
- Python API bilan oson yuklash
UCI ML Repository
- archive.ics.uci.edu/ml — klassik datasets
- Akademik standart
Google Dataset Search
- datasetsearch.research.google.com
- Universal search engine
Papers With Code
- paperswithcode.com/datasets — paper'larda ishlatilgan
data.gov / data.gov.uz
- data.gov.uz — O'zbekiston open data
- Lokal kontekst uchun
Klassik ML / Tabular
Boshlovchilar uchun
- Iris — 3 class classification (150 sample)
- Titanic — binary classification
- California Housing — regression
- MNIST — image classification (handwritten digits)
- Wine Quality — regression/classification
- Adult Income — classification
Real-world tabular
- Telco Customer Churn(Kaggle) — churn prediction
- House Prices(Kaggle Ames Housing) — regression
- Credit Card Fraud Detection — imbalanced classification
- NYC Taxi Trips — time series + geo
- Olist E-commerce(Kaggle) — multi-table
- LendingClub Loans — credit risk
- Movie Lens — recommendations
Computer Vision
Image classification
- CIFAR-10, CIFAR-100 — 32x32 color images
- ImageNet — 1000 classes (akademik)
- Fashion-MNIST — kiyim turlari
- Tiny ImageNet — kichikroq versiya
- Caltech 101/256 — har xil obyektlar
- Stanford Cars — avtomobillar
- Oxford Flowers — 102 gul turi
- Food-101 — ovqat rasmlari
Object detection
- COCO — eng katta detection dataset
- Pascal VOC — klassik
- Open Images(Google) — 9M images
- KITTI — autonomous driving
- WIDER FACE — face detection
- LVIS — long-tail detection
Segmentation
- Cityscapes — urban scene
- ADE20K — scene parsing
- PASCAL VOC Segmentation
- Mapillary Vistas — street view
Medical imaging
- MURA — bone X-rays
- CheXpert — chest X-rays
- ISIC — skin lesions
- Kaggle Medical Image Datasets
Specialized
- PlantVillage — plant diseases
- DeepFashion — fashion images
- CelebA — face attributes
- LFW — face recognition
NLP
Text classification
- IMDB Reviews — sentiment
- Yelp Reviews — sentiment
- AG News — topic classification
- SST-2 — sentiment
- 20 Newsgroups — topic
NER
- CoNLL-2003 — English NER
- OntoNotes 5.0 — multi-genre
Question Answering
- SQuAD 2.0 — extractive QA
- Natural Questions — Google
- TriviaQA
Translation
- WMT — annual translation
- OPUS — parallel corpora
Summarization
- CNN/Daily Mail — news summarization
- XSum — extreme summarization
- Reddit TIFU — informal
Multi-task / Modern
- GLUE / SuperGLUE — NLU benchmark
- MMLU — knowledge benchmark
- HellaSwag — common sense
Multilingual
- OSCAR — multilingual web
- mC4 — Common Crawl
- CC-100 — 100+ tillar
- FLORES — translation benchmark
O'zbek tilidagi datasetlar
Resmiy
- data.gov.uz — open data
- stat.uz — statistika
- lex.uz — qonun hujjatlari
Web scraping mumkin
- uz.wikipedia.org — Wikipedia dump
- daryo.uz, kun.uz, gazeta.uz — yangiliklar (legal/personal use)
- Telegram channellar — public channels (with respect)
HuggingFace
- HuggingFace'da
language:uzqidiring - OSCAR-uz — uzbek web corpus
- mC4-uz — Common Crawl
Audio (speech)
- Common Voice — Uzbek — Mozilla project
- Voxlingua107 — language identification
Audio / Speech
Speech recognition
- LibriSpeech — English audiobooks
- Common Voice(Mozilla) — multilingual
- VoxPopuli — European Parliament
- TED-LIUM — TED talks
Music
- GTZAN — genre classification
- FMA (Free Music Archive)
- MagnaTagATune — auto-tagging
Environmental
- UrbanSound8K — city sounds
- ESC-50 — environmental sounds
Video
- Kinetics-400/700 — action recognition
- UCF101 — action recognition
- YouTube-8M — large scale
- Something-Something — temporal reasoning
Time Series
Finance
- Yahoo Finance(yfinance library) — stocks
- Quandl — financial data
- Kaggle Stock Market
Healthcare
- MIT-BIH — ECG signals
- MIMIC — clinical (access kerak)
IoT / Sensors
- UCI HAR — human activity
- WESAD — stress detection
Weather / Environment
- NOAA — climate
- NASA Earth Data
Multimodal
- MS COCO Captions — image + text
- Flickr30k — image + text
- Visual Question Answering (VQA)
- AudioSet — video + audio
- HowTo100M — instruction videos
LLM / RAG
Documentation
- Wikipedia dump — keng knowledge base
- arxiv — research papers
- GitHub repos — code docs
- StackExchange dumps — Q&A
Conversation
- Anthropic HH-RLHF — preferences
- ShareGPT — real ChatGPT logs
- OpenAssistant — public conversations
Instruction
- Alpaca — Stanford (52K)
- Dolly — Databricks (15K)
- Tulu — AllenAI
🎯 Qaysi dataset qachon?
Yangi mavzuni o'rganishda
- **Boshlovchi:**Iris, Titanic, MNIST
- **Klassik ML:**Telco Churn, House Prices
- **DL boshlash:**CIFAR-10, IMDB
- **CV:**Pretrained datasets + custom
Portfolio loyiha uchun
- Original — o'zingiz to'plang (telefon, web scraping)
- Real-world — Kaggle competitions
- Lokal — O'zbekiston open data
Production simulation
- Streaming — Kafka simulated data
- Live — public APIs (Twitter, Reddit)
- Synthetic —
make_classification, faker library
Tools
Dataset library'lar
# scikit-learn datasets
from sklearn.datasets import load_iris, fetch_california_housing
# HuggingFace datasets
from datasets import load_dataset
ds = load_dataset("squad")
# torchvision
from torchvision import datasets
mnist = datasets.MNIST(root="./", train=True, download=True)
# Kaggle API
!pip install kaggle
!kaggle competitions download -c titanic
Annotation tools
- Label Studio — open source
- CVAT — CV annotation
- Roboflow — CV + datasets management
- Prodigy — NLP annotation
- Doccano — text annotation (open source)
Legal va ethik
Tekshirib qo'ying
- License — MIT, Apache, CC-BY, CC-BY-SA, va h.k.
- Commercial use — bepulmi yoki yo'qmi
- Attribution — manbani ko'rsatish kerakmi
- PII — shaxsiy ma'lumotlar bormi
Best practices
- Bias check — dataset balanced/representative emi?
- Privacy — anonimization
- Documentation — datasheet, model card
- Consent — yig'ilgan ma'lumotlar uchun
Cheatsheets ga o'tish.