[AIR-3][AIS-3][BPC-3][RES-3]
AI Model Development Guide¶
This guide outlines the process for developing and training AI models for the Anya Core platform.
Table of Contents¶
- Overview
- Development Environment
- Project Structure
- Model Architecture
- Training Pipeline
- Evaluation
- Deployment
- Monitoring
- Best Practices
- Troubleshooting
Overview¶
This guide covers the end-to-end process of developing AI models for Anya Core, from setting up the development environment to deploying models in production.
Development Environment¶
Prerequisites¶
- Python 3.8+
- CUDA 11.7+ (for GPU acceleration)
- Docker (optional)
- Git LFS (for large model files)
Setup¶
- Clone the repository:
bash
git clone https://github.com/anya-org/anya-core.git
cd anya-core/ai
- Create a virtual environment:
bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
bash
pip install -r requirements-dev.txt
pip install -e .
- Install pre-commit hooks:
bash
pre-commit install
Project Structure¶
ai/
├── configs/ # Model configurations
├── data/ # Dataset storage
│ ├── raw/ # Raw data
│ ├── processed/ # Processed data
│ └── splits/ # Train/val/test splits
├── models/ # Model implementations
├── notebooks/ # Jupyter notebooks
├── scripts/ # Utility scripts
├── tests/ # Unit and integration tests
├── training/ # Training pipelines
└── utils/ # Utility functions
Model Architecture¶
Design Principles¶
- Modularity: Separate model architecture from training logic
- Reproducibility: Ensure experiments are reproducible
- Scalability: Design for distributed training
- Maintainability: Follow clean code principles
Example Model¶
# models/transformer.py
import torch
import torch.nn as nn
class TransformerModel(nn.Module):
def __init__(self, vocab_size, d_model, nhead, num_layers, dropout=0.1):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.pos_encoder = PositionalEncoding(d_model, dropout)
encoder_layers = nn.TransformerEncoderLayer(d_model, nhead, dropout=dropout)
self.transformer_encoder = nn.TransformerEncoder(encoder_layers, num_layers)
self.decoder = nn.Linear(d_model, vocab_size)
def forward(self, src, src_mask=None):
src = self.embedding(src) * math.sqrt(self.d_model)
src = self.pos_encoder(src)
output = self.transformer_encoder(src, src_mask)
output = self.decoder(output)
return output
Training Pipeline¶
Data Preparation¶
- Data Loading
from torch.utils.data import Dataset, DataLoader
class TextDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_length):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = str(self.texts[idx])
label = self.labels[idx]
encoding = self.tokenizer(
text,
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)
return {
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'labels': torch.tensor(label, dtype=torch.long)
}
- Training Loop
def train_epoch(model, data_loader, optimizer, device, scheduler=None):
model.train()
total_loss = 0
for batch in tqdm(data_loader, desc="Training"):
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
optimizer.zero_grad()
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels
)
loss = outputs.loss
total_loss += loss.item()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
if scheduler:
scheduler.step()
return total_loss / len(data_loader)
Hyperparameter Tuning¶
Use Optuna for hyperparameter optimization:
import optuna
def objective(trial):
# Define hyperparameters to tune
lr = trial.suggest_float('lr', 1e-5, 1e-3, log=True)
batch_size = trial.suggest_categorical('batch_size', [16, 32, 64])
num_epochs = trial.suggest_int('num_epochs', 3, 10)
# Initialize model and data loaders
model = initialize_model()
train_loader, val_loader = get_data_loaders(batch_size)
# Training loop
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
for epoch in range(num_epochs):
train_loss = train_epoch(model, train_loader, optimizer)
val_loss = evaluate(model, val_loader)
# Report intermediate results
trial.report(val_loss, epoch)
# Handle pruning
if trial.should_prune():
raise optuna.TrialPruned()
return val_loss
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100)
Evaluation¶
Metrics¶
from sklearn.metrics import (
accuracy_score, precision_score,
recall_score, f1_score, roc_auc_score
)
def calculate_metrics(preds, labels):
return {
'accuracy': accuracy_score(labels, preds),
'precision': precision_score(labels, preds, average='weighted'),
'recall': recall_score(labels, preds, average='weighted'),
'f1': f1_score(labels, preds, average='weighted'),
'roc_auc': roc_auc_score(labels, preds, multi_class='ovr')
}
Cross-Validation¶
from sklearn.model_selection import cross_val_score, StratifiedKFold
# Define cross-validation strategy
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Perform cross-validation
scores = cross_val_score(
model, X, y,
cv=cv,
scoring='f1_weighted',
n_jobs=-1
)
print(f"Cross-validation scores: {scores}")
print(f"Mean CV score: {scores.mean():.4f} ± {scores.std():.4f}")
Deployment¶
Model Packaging¶
import torch
from transformers import AutoModelForSequenceClassification
# Load trained model
model = AutoModelForSequenceClassification.from_pretrained("path/to/model")
# Save model
model.save_pretrained("deploy/model")
# Convert to ONNX
dummy_input = torch.zeros(1, 128, dtype=torch.long)
torch.onnx.export(
model,
dummy_input,
"deploy/model.onnx",
input_names=["input_ids"],
output_names=["logits"],
dynamic_axes={
"input_ids": {0: "batch", 1: "sequence"},
"logits": {0: "batch"}
}
)
API Service¶
from fastapi import FastAPI
from pydantic import BaseModel
import torch
app = FastAPI()
model = None
def load_model():
global model
model = torch.jit.load("deploy/model.pt")
model.eval()
class TextRequest(BaseModel):
text: str
@app.post("/predict")
async def predict(request: TextRequest):
inputs = tokenizer(
request.text,
return_tensors="pt",
padding=True,
truncation=True,
max_length=128
)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
pred = torch.argmax(probs, dim=-1).item()
return {"prediction": pred, "confidence": probs[0][pred].item()}
Monitoring¶
Logging¶
import logging
from logging.handlers import RotatingFileHandler
import json
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
RotatingFileHandler('logs/model.log', maxBytes=10485760, backupCount=5),
logging.StreamHandler()
]
)
logger = logging.getLogger(__name__)
def log_prediction(input_data, prediction, confidence):
log_entry = {
'timestamp': datetime.utcnow().isoformat(),
'input': input_data,
'prediction': prediction,
'confidence': confidence,
'metadata': {
'model_version': '1.0.0',
'environment': 'production'
}
}
logger.info(json.dumps(log_entry))
Performance Monitoring¶
from prometheus_client import start_http_server, Summary, Counter, Gauge
# Create metrics
REQUEST_LATENCY = Summary('request_latency_seconds', 'Request latency')
PREDICTION_COUNTER = Counter('predictions_total', 'Total predictions', ['label'])
MODEL_CONFIDENCE = Gauge('model_confidence', 'Model confidence', ['label'])
@REQUEST_LATENCY.time()
def predict(input_data):
# Make prediction
prediction, confidence = model.predict(input_data)
# Update metrics
PREDICTION_COUNTER.labels(label=prediction).inc()
MODEL_CONFIDENCE.labels(label=prediction).set(confidence)
return prediction, confidence
# Start metrics server
start_http_server(8000)
Best Practices¶
1. Code Quality¶
- Follow PEP 8 style guide
- Use type hints
- Write docstrings and comments
- Maintain high test coverage
2. Model Development¶
- Start with a simple baseline
- Use version control for models
- Document all experiments
- Track hyperparameters and metrics
3. Reproducibility¶
- Use fixed random seeds
- Pin dependency versions
- Save model checkpoints
- Log all hyperparameters
4. Performance¶
- Profile your code
- Use mixed precision training
- Optimize data loading
- Implement gradient accumulation
Troubleshooting¶
Common Issues¶
- CUDA Out of Memory
- Reduce batch size
- Use gradient accumulation
-
Enable gradient checkpointing
-
Training is Slow
- Enable mixed precision training
- Use a larger batch size
-
Profile data loading
-
Model Not Converging
- Check learning rate
- Verify data preprocessing
- Try a different optimizer
Debugging Tools¶
# Check for NaNs
torch.autograd.set_detect_anomaly(True)
# Profile code
with torch.profiler.profile() as prof:
# Your code here
pass
print(prof.key_averages().table(sort_by="self_cuda_time_total"))
# Debug NaN values
torch.autograd.detect_anomaly()
Conclusion¶
This guide provides a comprehensive overview of AI model development for the Anya Core platform. For more information, refer to the API documentation and examples.