0% found this document useful (0 votes)
207 views2 pages

Databricks ML Associate Crash Study Material

The document provides comprehensive crash study material for the Databricks ML Associate Exam, covering key topics in machine learning, data processing, model development, and deployment. It emphasizes MLOps best practices, the use of feature stores, model registration with MLflow, and various data processing techniques. Additionally, it outlines model training and evaluation metrics, as well as deployment strategies for batch and real-time inference.

Uploaded by

gamingsrvz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
207 views2 pages

Databricks ML Associate Crash Study Material

The document provides comprehensive crash study material for the Databricks ML Associate Exam, covering key topics in machine learning, data processing, model development, and deployment. It emphasizes MLOps best practices, the use of feature stores, model registration with MLflow, and various data processing techniques. Additionally, it outlines model training and evaluation metrics, as well as deployment strategies for batch and real-time inference.

Uploaded by

gamingsrvz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Databricks ML Associate Exam - Crash Study Material

Section 1: Databricks Machine Learning


- Understand MLOps best practices in Databricks
- Advantages of ML runtimes (e.g., optimized libraries, GPU support)
- Understand AutoML's role in automating feature/model selection
- Advantages of AutoML: saves time, improves reproducibility
- Difference between account-level and workspace-level feature store tables
- Steps to create and register a feature store table in Unity Catalog
- Write data to feature store tables using Python APIs
- Train models using features directly from feature store tables
- Use feature tables for scoring and inference
- Differentiate between online (low latency) and offline (batch) feature tables
- Use MLflow Client API to identify best run (based on metrics)
- Manually log metrics, parameters, models, artifacts using MLflow tracking
- Explore the MLflow UI for experiment insights
- Register models via MLflow Client API into Unity Catalog
- Understand benefits of using Unity Catalog model registry
- Model vs. code promotion in production scenarios
- Set or remove tags for a model for metadata tracking
- Promote challenger model to champion using aliases

Section 2: Data Processing


- Use `.summary()` or dbutils for Spark DataFrame statistics
- Outlier detection using standard deviation or IQR techniques
- Visualize categorical (bar plots) and continuous (histograms, boxplots) data
- Compare categorical features (Chi-square), continuous (correlation, t-tests)
- When and how to use mean/median/mode for missing value imputation
- Apply imputation using pandas, sklearn, or Spark functions
- Perform one-hot encoding using sklearn or `OneHotEncoderEstimator` in Spark
- Understand when one-hot encoding is useful or not (e.g., high cardinality)
- When to apply log scaling (e.g., skewed features, exponential distributions)
Section 3: Model Development
- Select algorithms based on task: classification, regression, clustering
- Handle data imbalance: SMOTE, class weights, resampling, threshold tuning
- Difference between estimators (fit) and transformers (transform) in Spark
- Build end-to-end training pipelines using `Pipeline()` API
- Use `fmin` from Hyperopt to tune models (Bayesian optimization)
- Implement GridSearch, RandomSearch, and Hyperopt tuning techniques
- Use Spark parallelism to scale hyperparameter tuning jobs
- Cross-validation vs. train-validation split: pros/cons
- Implement `CrossValidator()` in Spark or `cross_val_score()` in sklearn
- Understand the number of models trained in grid+CV (cartesian * folds)
- Classification metrics: F1, AUC, Precision, Recall, LogLoss
- Regression metrics: RMSE, MAE, R2
- Choose right metric for task (business goal focused)
- Exponentiate log-transformed predictions before interpretation or evaluation
- Understand bias-variance tradeoff and its effect on model complexity

Section 4: Model Deployment


- Compare batch vs. realtime vs. streaming inference methods
- Deploy models using Databricks Model Serving endpoints
- Use pandas to apply batch inference from saved models
- Streaming inference using Delta Live Tables + UDFs
- Deploy realtime inference with low-latency endpoints
- Use routing/splitting logic to send traffic to different endpoints for testing

You might also like