Databricks ML Associate Exam - Crash Study Material
Section 1: Databricks Machine Learning
- Understand MLOps best practices in Databricks
- Advantages of ML runtimes (e.g., optimized libraries, GPU support)
- Understand AutoML's role in automating feature/model selection
- Advantages of AutoML: saves time, improves reproducibility
- Difference between account-level and workspace-level feature store tables
- Steps to create and register a feature store table in Unity Catalog
- Write data to feature store tables using Python APIs
- Train models using features directly from feature store tables
- Use feature tables for scoring and inference
- Differentiate between online (low latency) and offline (batch) feature tables
- Use MLflow Client API to identify best run (based on metrics)
- Manually log metrics, parameters, models, artifacts using MLflow tracking
- Explore the MLflow UI for experiment insights
- Register models via MLflow Client API into Unity Catalog
- Understand benefits of using Unity Catalog model registry
- Model vs. code promotion in production scenarios
- Set or remove tags for a model for metadata tracking
- Promote challenger model to champion using aliases
Section 2: Data Processing
- Use `.summary()` or dbutils for Spark DataFrame statistics
- Outlier detection using standard deviation or IQR techniques
- Visualize categorical (bar plots) and continuous (histograms, boxplots) data
- Compare categorical features (Chi-square), continuous (correlation, t-tests)
- When and how to use mean/median/mode for missing value imputation
- Apply imputation using pandas, sklearn, or Spark functions
- Perform one-hot encoding using sklearn or `OneHotEncoderEstimator` in Spark
- Understand when one-hot encoding is useful or not (e.g., high cardinality)
- When to apply log scaling (e.g., skewed features, exponential distributions)
Section 3: Model Development
- Select algorithms based on task: classification, regression, clustering
- Handle data imbalance: SMOTE, class weights, resampling, threshold tuning
- Difference between estimators (fit) and transformers (transform) in Spark
- Build end-to-end training pipelines using `Pipeline()` API
- Use `fmin` from Hyperopt to tune models (Bayesian optimization)
- Implement GridSearch, RandomSearch, and Hyperopt tuning techniques
- Use Spark parallelism to scale hyperparameter tuning jobs
- Cross-validation vs. train-validation split: pros/cons
- Implement `CrossValidator()` in Spark or `cross_val_score()` in sklearn
- Understand the number of models trained in grid+CV (cartesian * folds)
- Classification metrics: F1, AUC, Precision, Recall, LogLoss
- Regression metrics: RMSE, MAE, R2
- Choose right metric for task (business goal focused)
- Exponentiate log-transformed predictions before interpretation or evaluation
- Understand bias-variance tradeoff and its effect on model complexity
Section 4: Model Deployment
- Compare batch vs. realtime vs. streaming inference methods
- Deploy models using Databricks Model Serving endpoints
- Use pandas to apply batch inference from saved models
- Streaming inference using Delta Live Tables + UDFs
- Deploy realtime inference with low-latency endpoints
- Use routing/splitting logic to send traffic to different endpoints for testing