NLP asoslari

🎯 Maqsad

Bu bobni o’qib bo’lgach:

NLP (Natural Language Processing) ning asosiy masala turlarini bilasiz
spaCy va NLTK bilan klassik NLP pipeline qura olasiz
TF-IDF, Word2Vec, GloVe vektor representation’larini bilasiz
HuggingFace (Oy 5’da chuqurroq) ekosistemasiga tayyor bo’lasiz

Nimani o’rganish kerak

NLP masala turlari — classification, NER, POS, parsing, generation, translation
Tokenization — word, subword (BPE, WordPiece, SentencePiece), char-level
Stemming va Lemmatization
Stop words
**Bag of Words (BoW)**va TF-IDF
n-grams
Word embeddings — Word2Vec, GloVe, FastText
POS tagging, dependency parsing
Named Entity Recognition (NER)
Language detection
O’zbek tili uchun NLP

Kutubxonalar

pip install nltk spacy textblob
python -m spacy download en_core_web_sm    # English
python -m spacy download ru_core_news_sm   # Russian (uzbek uchun yaqinroq)
python -m spacy download xx_ent_wiki_sm    # Multilingual

pip install gensim                          # Word2Vec, topic modeling
pip install langdetect polyglot            # Language detection

pip install scikit-learn                   # TF-IDF

NLP masala turlari

Task	Misol	Approach
Text Classification	Sentiment, spam, news category	TF-IDF + LR, BERT
Named Entity Recognition (NER)	“Toshkent” → LOC	spaCy, BERT-NER
Part-of-Speech (POS) Tagging	“yugurish” → VERB	spaCy
Dependency Parsing	Subject-verb-object	spaCy
Text Generation	Auto-complete	GPT, T5
Translation	EN → UZ	MarianMT, GPT-4
Summarization	Long → short text	BART, T5, GPT
Question Answering	Q + Context → Answer	BERT, RoBERTa
Topic Modeling	Articles → topics	LDA, BERTopic
Speech to Text	Audio → text	Whisper
Text Similarity	Sentence pairs	Sentence-BERT

Kod misollari

NLTK — klassik NLP

import nltk
nltk.download(['punkt', 'stopwords', 'wordnet', 'averaged_perceptron_tagger'])

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

text = "Natural Language Processing is amazing! It allows computers to understand human language."

# Sentence tokenization
sents = sent_tokenize(text)
# ['Natural Language Processing is amazing!', 'It allows computers to understand human language.']

# Word tokenization
words = word_tokenize(text)
# ['Natural', 'Language', 'Processing', 'is', ...]

# Stop words removal
stop_words = set(stopwords.words('english'))
filtered = [w for w in words if w.lower() not in stop_words and w.isalpha()]

# Stemming
stemmer = PorterStemmer()
stems = [stemmer.stem(w) for w in filtered]
# 'amazing' → 'amaz', 'computers' → 'comput'

# Lemmatization (POS-aware, yaxshiroq)
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(w.lower()) for w in filtered]
# 'computers' → 'computer'

spaCy — modern NLP pipeline

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion in 2024.")

# Tokenization + POS + NER + DEP
for token in doc:
    print(f"{token.text:15s} {token.pos_:10s} {token.dep_:10s} {token.lemma_}")

# Named Entity Recognition
for ent in doc.ents:
    print(f"{ent.text:20s} {ent.label_}")
# Output:
# Apple                ORG
# U.K.                 GPE
# $1 billion           MONEY
# 2024                 DATE

# Noun chunks
for chunk in doc.noun_chunks:
    print(chunk.text)

TF-IDF — Bag of Words

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "Natural language processing is fun",
    "Machine learning powers natural language processing",
    "Deep learning has revolutionized NLP",
    "Backend development requires understanding APIs",
]

vectorizer = TfidfVectorizer(
    max_features=100,
    ngram_range=(1, 2),       # unigrams + bigrams
    stop_words="english",
    min_df=1,
    max_df=0.95,
)

X = vectorizer.fit_transform(corpus)
print(X.shape)                                # (4, 100)
print(vectorizer.get_feature_names_out()[:10])

Text classification — Naive Bayes baseline

from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1, 2), max_features=5000)),
    ("clf", LogisticRegression(max_iter=1000)),
])

pipeline.fit(train_texts, train_labels)
accuracy = pipeline.score(test_texts, test_labels)

# Yangi text uchun
prediction = pipeline.predict(["This product is excellent!"])

Word2Vec — embeddings

from gensim.models import Word2Vec

sentences = [
    ["natural", "language", "processing"],
    ["machine", "learning", "models"],
    ["deep", "learning", "neural", "networks"],
    # ...
]

model = Word2Vec(
    sentences,
    vector_size=100,
    window=5,
    min_count=1,
    workers=4,
    epochs=10,
)

# Bitta so'z vektori
vec = model.wv["natural"]                     # shape (100,)

# Eng o'xshash so'zlar
similar = model.wv.most_similar("natural", topn=5)

# So'zlar orasidagi cosine similarity
sim = model.wv.similarity("language", "processing")

Pretrained embeddings (GloVe)

import gensim.downloader

# 100MB GloVe (Wikipedia 6B tokens)
model = gensim.downloader.load("glove-wiki-gigaword-100")

print(model["king"].shape)                    # (100,)
print(model.most_similar("king", topn=5))
print(model.most_similar(positive=["king", "woman"], negative=["man"]))
# → "queen" yaqin natija

Language detection

from langdetect import detect, detect_langs

print(detect("Salom! Mening ismim Ali."))     # uz (yoki uz hidoyat, ko'p hollarda)
print(detect_langs("Hello, how are you?"))    # [en:0.99]

O’zbek tili uchun NLP

Hozirgi vaziyat

Resurs kam: nlp uchun pretrained o’zbek modellari ozchilik
Yaxshi tomonlari: multilingual modellar(mBERT, XLM-R, mT5) o’zbek tilini qisman qo’llab-quvvatlaydi
Latin va Kirillikkalasini ham hisobga olish kerak

Foydali resurslar

HuggingFace’da o’zbek modellari(qidirish: uzbek)
OpenAI/Anthropic — GPT-4 va Claude o’zbek tilini yaxshi tushinadi (Oy 5)
Whisper — o’zbek nutqni transkripsiya qila oladi
Common Voice — Uzbek dataset(Mozilla)

O’zbek matn bilan ishlash

import spacy

# Multilingual model (o'zbek qisman)
nlp = spacy.load("xx_ent_wiki_sm")

text = "Toshkent shahri 2024 yilda yangi loyihalar boshladi."
doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)

# Better: HuggingFace XLM-R based (Oy 5)

Lotin ↔ Kirill konvertor (sodda)

LATIN_TO_CYRILLIC = {
    "sh": "ш", "ch": "ч", "yo": "ё", "yu": "ю", "ya": "я", "o'": "ў", "g'": "ғ",
    "a": "а", "b": "б", "d": "д", "e": "е", "f": "ф", "g": "г", "h": "ҳ",
    "i": "и", "j": "ж", "k": "к", "l": "л", "m": "м", "n": "н", "o": "о",
    "p": "п", "q": "қ", "r": "р", "s": "с", "t": "т", "u": "у", "v": "в",
    "x": "х", "y": "й", "z": "з", "'": "ъ",
}

def latin_to_cyrillic(text: str) -> str:
    result = text.lower()
    # 2-character first
    for lat, cyr in sorted(LATIN_TO_CYRILLIC.items(), key=lambda x: -len(x[0])):
        result = result.replace(lat, cyr)
    return result

Backend integratsiyasi

Text classification API

from fastapi import FastAPI
from pydantic import BaseModel
import joblib

app = FastAPI()
pipeline = joblib.load("text_classifier.joblib")  # TfidfVectorizer + Classifier

class TextInput(BaseModel):
    text: str
    language: str = "en"

@app.post("/classify")
def classify_text(data: TextInput):
    prediction = pipeline.predict([data.text])[0]
    proba = pipeline.predict_proba([data.text])[0]
    
    return {
        "predicted_class": str(prediction),
        "confidence": float(proba.max()),
        "all_probabilities": dict(zip(pipeline.classes_, proba.tolist())),
    }

Sentiment + NER endpoint

import spacy

nlp_en = spacy.load("en_core_web_sm")

@app.post("/analyze")
def analyze_text(data: TextInput):
    doc = nlp_en(data.text)
    
    entities = [
        {"text": ent.text, "type": ent.label_, "start": ent.start_char, "end": ent.end_char}
        for ent in doc.ents
    ]
    
    pos_counts = {}
    for token in doc:
        pos_counts[token.pos_] = pos_counts.get(token.pos_, 0) + 1
    
    return {
        "entities": entities,
        "pos_distribution": pos_counts,
        "tokens": len(doc),
        "sentences": len(list(doc.sents)),
    }

Text similarity service

import gensim.downloader as api
import numpy as np

model = api.load("glove-wiki-gigaword-100")

def text_to_vector(text: str) -> np.ndarray:
    words = text.lower().split()
    vectors = [model[w] for w in words if w in model]
    return np.mean(vectors, axis=0) if vectors else np.zeros(100)

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

@app.post("/similarity")
def similarity(text1: str, text2: str):
    v1 = text_to_vector(text1)
    v2 = text_to_vector(text2)
    return {"similarity": float(cosine_similarity(v1, v2))}

Resurslar

NLTK Book — nltk.org/book
spaCy docs — spacy.io
“Speech and Language Processing” — Jurafsky & Martin (free PDF — bibliya)
HuggingFace NLP Course — bepul, Oy 5 uchun tayyorgarlik
Stanford NLP videos — Chris Manning
gensim docs — Word2Vec, topic modeling

🏋️ Mashqlar

🟢 Easy

Bir matnni tokenize qiling, stop words olib tashlang, lemmatize qiling.
spaCy bilan POS tagging va NER.
TF-IDF bilan 5 ta hujjat orasida o’xshashlikni hisoblang.

🟡 Medium

News classification: 4-5 ta kategoriya (BBC dataset), TF-IDF + Logistic Regression, 90%+ accuracy.
Spam classifier: SMS Spam dataset, Naive Bayes vs LogReg solishtirish.
NER pipeline: matnda nomlangan obyektlarni topib, tip bo’yicha guruhlash.

🔴 Hard

Uzbek text classifier: o’zingiz Telegram channellardan dataset to’plang (2-3 kategoriya), TF-IDF + LR baseline.
NER service: FastAPI + spaCy + caching (Redis) — yuqori RPS uchun optimize.
Topic modeling: 1000+ ta hujjatlarni LDA yoki BERTopic bilan topic’larga ajrating, vizualizatsiya qiling.

Capstone

notebooks/month-04/04_nlp_basics.ipynb:

**Loyiha:**O’zbek tilidagi yangiliklar (Daryo.uz, Kun.uz) yoki Telegram channellardan dataset
TF-IDF + Logistic Regression bilan baseline classifier
spaCy multilingual bilan NER
Word2Vec o’rgatib similar so’zlarni topish
FastAPI servisi

✅ Tekshirish ro’yxati

Tokenization, stemming, lemmatization farqini bilaman
BoW va TF-IDF ni ishlatishni bilaman
spaCy bilan NER, POS, parsing
Word2Vec va GloVe embedding’larini ishlataman
Text classification baseline (TF-IDF + LR)
O’zbek tili uchun NLP cheklovlarini bilaman
FastAPI’da NLP endpoint yarata olaman

Text Preprocessing ga o’tamiz.

Keyboard shortcuts

Backend to ML: 6 Oylik Roadmap