Datasets

Asosiy manbalar

Kaggle

  • kaggle.com/datasets — minglab dataset
  • kaggle.com/competitions — competitions (real problems)
  • Klassik ML uchun #1 manba
  • Notebook'lar bilan birga

Hugging Face Datasets

  • huggingface.co/datasets — NLP/CV/Audio
  • 100,000+ dataset
  • Python API bilan oson yuklash

UCI ML Repository

  • archive.ics.uci.edu/ml — klassik datasets
  • Akademik standart
  • datasetsearch.research.google.com
  • Universal search engine

Papers With Code

  • paperswithcode.com/datasets — paper'larda ishlatilgan

data.gov / data.gov.uz

  • data.gov.uz — O'zbekiston open data
  • Lokal kontekst uchun

Klassik ML / Tabular

Boshlovchilar uchun

  • Iris — 3 class classification (150 sample)
  • Titanic — binary classification
  • California Housing — regression
  • MNIST — image classification (handwritten digits)
  • Wine Quality — regression/classification
  • Adult Income — classification

Real-world tabular

  • Telco Customer Churn(Kaggle) — churn prediction
  • House Prices(Kaggle Ames Housing) — regression
  • Credit Card Fraud Detection — imbalanced classification
  • NYC Taxi Trips — time series + geo
  • Olist E-commerce(Kaggle) — multi-table
  • LendingClub Loans — credit risk
  • Movie Lens — recommendations

Computer Vision

Image classification

  • CIFAR-10, CIFAR-100 — 32x32 color images
  • ImageNet — 1000 classes (akademik)
  • Fashion-MNIST — kiyim turlari
  • Tiny ImageNet — kichikroq versiya
  • Caltech 101/256 — har xil obyektlar
  • Stanford Cars — avtomobillar
  • Oxford Flowers — 102 gul turi
  • Food-101 — ovqat rasmlari

Object detection

  • COCO — eng katta detection dataset
  • Pascal VOC — klassik
  • Open Images(Google) — 9M images
  • KITTI — autonomous driving
  • WIDER FACE — face detection
  • LVIS — long-tail detection

Segmentation

  • Cityscapes — urban scene
  • ADE20K — scene parsing
  • PASCAL VOC Segmentation
  • Mapillary Vistas — street view

Medical imaging

  • MURA — bone X-rays
  • CheXpert — chest X-rays
  • ISIC — skin lesions
  • Kaggle Medical Image Datasets

Specialized

  • PlantVillage — plant diseases
  • DeepFashion — fashion images
  • CelebA — face attributes
  • LFW — face recognition

NLP

Text classification

  • IMDB Reviews — sentiment
  • Yelp Reviews — sentiment
  • AG News — topic classification
  • SST-2 — sentiment
  • 20 Newsgroups — topic

NER

  • CoNLL-2003 — English NER
  • OntoNotes 5.0 — multi-genre

Question Answering

  • SQuAD 2.0 — extractive QA
  • Natural Questions — Google
  • TriviaQA

Translation

  • WMT — annual translation
  • OPUS — parallel corpora

Summarization

  • CNN/Daily Mail — news summarization
  • XSum — extreme summarization
  • Reddit TIFU — informal

Multi-task / Modern

  • GLUE / SuperGLUE — NLU benchmark
  • MMLU — knowledge benchmark
  • HellaSwag — common sense

Multilingual

  • OSCAR — multilingual web
  • mC4 — Common Crawl
  • CC-100 — 100+ tillar
  • FLORES — translation benchmark

O'zbek tilidagi datasetlar

Resmiy

  • data.gov.uz — open data
  • stat.uz — statistika
  • lex.uz — qonun hujjatlari

Web scraping mumkin

  • uz.wikipedia.org — Wikipedia dump
  • daryo.uz, kun.uz, gazeta.uz — yangiliklar (legal/personal use)
  • Telegram channellar — public channels (with respect)

HuggingFace

  • HuggingFace'da language:uz qidiring
  • OSCAR-uz — uzbek web corpus
  • mC4-uz — Common Crawl

Audio (speech)

  • Common Voice — Uzbek — Mozilla project
  • Voxlingua107 — language identification

Audio / Speech

Speech recognition

  • LibriSpeech — English audiobooks
  • Common Voice(Mozilla) — multilingual
  • VoxPopuli — European Parliament
  • TED-LIUM — TED talks

Music

  • GTZAN — genre classification
  • FMA (Free Music Archive)
  • MagnaTagATune — auto-tagging

Environmental

  • UrbanSound8K — city sounds
  • ESC-50 — environmental sounds

Video

  • Kinetics-400/700 — action recognition
  • UCF101 — action recognition
  • YouTube-8M — large scale
  • Something-Something — temporal reasoning

Time Series

Finance

  • Yahoo Finance(yfinance library) — stocks
  • Quandl — financial data
  • Kaggle Stock Market

Healthcare

  • MIT-BIH — ECG signals
  • MIMIC — clinical (access kerak)

IoT / Sensors

  • UCI HAR — human activity
  • WESAD — stress detection

Weather / Environment

  • NOAA — climate
  • NASA Earth Data

Multimodal

  • MS COCO Captions — image + text
  • Flickr30k — image + text
  • Visual Question Answering (VQA)
  • AudioSet — video + audio
  • HowTo100M — instruction videos

LLM / RAG

Documentation

  • Wikipedia dump — keng knowledge base
  • arxiv — research papers
  • GitHub repos — code docs
  • StackExchange dumps — Q&A

Conversation

  • Anthropic HH-RLHF — preferences
  • ShareGPT — real ChatGPT logs
  • OpenAssistant — public conversations

Instruction

  • Alpaca — Stanford (52K)
  • Dolly — Databricks (15K)
  • Tulu — AllenAI

🎯 Qaysi dataset qachon?

Yangi mavzuni o'rganishda

  • **Boshlovchi:**Iris, Titanic, MNIST
  • **Klassik ML:**Telco Churn, House Prices
  • **DL boshlash:**CIFAR-10, IMDB
  • **CV:**Pretrained datasets + custom

Portfolio loyiha uchun

  • Original — o'zingiz to'plang (telefon, web scraping)
  • Real-world — Kaggle competitions
  • Lokal — O'zbekiston open data

Production simulation

  • Streaming — Kafka simulated data
  • Live — public APIs (Twitter, Reddit)
  • Syntheticmake_classification, faker library

Tools

Dataset library'lar

# scikit-learn datasets
from sklearn.datasets import load_iris, fetch_california_housing

# HuggingFace datasets
from datasets import load_dataset
ds = load_dataset("squad")

# torchvision
from torchvision import datasets
mnist = datasets.MNIST(root="./", train=True, download=True)

# Kaggle API
!pip install kaggle
!kaggle competitions download -c titanic

Annotation tools

  • Label Studio — open source
  • CVAT — CV annotation
  • Roboflow — CV + datasets management
  • Prodigy — NLP annotation
  • Doccano — text annotation (open source)

Tekshirib qo'ying

  • License — MIT, Apache, CC-BY, CC-BY-SA, va h.k.
  • Commercial use — bepulmi yoki yo'qmi
  • Attribution — manbani ko'rsatish kerakmi
  • PII — shaxsiy ma'lumotlar bormi

Best practices

  • Bias check — dataset balanced/representative emi?
  • Privacy — anonimization
  • Documentation — datasheet, model card
  • Consent — yig'ilgan ma'lumotlar uchun

Cheatsheets ga o'tish.