Skip to content

dream-horizon-org/darwin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Darwin ML Platform

🌐 Visit Darwin Platform

Darwin is an enterprise-grade, end-to-end machine learning platform designed for production-scale AI/ML workloads. It provides a unified ecosystem for the complete ML lifecycleβ€”from distributed compute and feature engineering to experiment tracking, model deployment, and real-time inference serving.


🎯 Why Darwin?

Darwin solves critical challenges in production ML infrastructure:

  • Unified Platform: Single platform for training, serving, and feature engineeringβ€”no context switching between disparate tools
  • Production-Grade Scalability: Built on Kubernetes and Ray for elastic, distributed compute at scale
  • Cost Optimization: Intelligent auto-scaling, spot instance support, and policy-based auto-termination
  • Developer Velocity: SDK-first design with CLI tools for rapid experimentation and deployment
  • Enterprise Ready: Multi-tenancy, RBAC, audit logging, and metadata lineage out of the box
  • Low-Latency Serving: Sub-10ms feature retrieval and optimized model inference pipelines

πŸ—οΈ Architecture Overview

graph TB
    subgraph "User Interface Layer"
        UI[Workspace UI/Jupyter]
        CLI[Hermes CLI]
        SDK[Python SDKs]
    end

    subgraph "Orchestration Layer"
        Workspace[Workspace Service<br/>Projects & Codespaces]
        MLflow[MLflow<br/>Experiment Tracking]
        Chronos[Chronos<br/>Event & Metadata]
    end

    subgraph "Compute Layer"
        Compute[Darwin Compute<br/>Cluster Management]
        DCM[Darwin Cluster Manager<br/>K8s Orchestration]
        Ray[Ray Clusters<br/>Distributed Execution]
    end

    subgraph "Data Layer"
        FS[Feature Store<br/>Online/Offline Features]
        Catalog[Darwin Catalog<br/>Asset Discovery]
    end

    subgraph "Serving Layer"
        Serve[ML Serve<br/>Model Deployment]
        Builder[Artifact Builder<br/>Image Building]
    end

    subgraph "Infrastructure"
        MySQL[(MySQL<br/>Metadata)]
        Cassandra[(Cassandra<br/>Features)]
        OpenSearch[(OpenSearch<br/>Events)]
        S3[(S3<br/>Artifacts)]
        Kafka[(Kafka<br/>Streaming)]
        K8s[Kubernetes/EKS]
    end

    UI --> Workspace
    CLI --> Serve
    CLI --> Compute
    SDK --> Compute
    SDK --> FS
    SDK --> MLflow

    Workspace --> Compute
    Workspace --> Chronos
    MLflow --> S3
    MLflow --> MySQL

    Compute --> DCM
    DCM --> Ray
    Ray --> K8s

    Serve --> Builder
    Serve --> DCM
    Builder --> K8s

    FS --> Cassandra
    FS --> Kafka
    FS --> MySQL
    Catalog --> MySQL
    Catalog --> OpenSearch
    Chronos --> OpenSearch
    Chronos --> Kafka

    style Darwin fill:#e1f5ff
    style Compute fill:#ffe1e1
    style FS fill:#e1ffe1
    style Serve fill:#fff5e1
Loading

πŸ“¦ Platform Components

1. Darwin Compute

Distributed compute orchestration for ML workloads

  • Ray Cluster Management: Create, scale, and manage Ray 2.37.0 clusters on Kubernetes
  • Multi-Runtime Support: Pre-configured runtimes (Ray + Python 3.10 + Spark 3.5.1)
  • Resource Optimization:
    • Spot/on-demand instance mixing
    • Auto-termination policies (idle detection, CPU thresholds)
    • Cost monitoring with Slack alerts
  • Package Management: Dynamic installation of PyPI, Maven, and workspace packages
  • Jupyter Integration: Managed Jupyter notebooks with direct cluster access
  • Job Scheduling: Ray job submission and monitoring

SDK: darwin-compute

from darwin_compute import ComputeCluster

cluster = ComputeCluster(env="prod")
result = cluster.create_with_yaml("cluster-config.yaml")
cluster.start(cluster_id=result['cluster_id'])

2. Darwin Cluster Manager (DCM)

Low-level Kubernetes orchestration service (Go)

  • Helm-based Ray cluster deployment via KubeRay operator
  • Dynamic values.yaml generation for cluster configurations
  • Remote command execution on cluster pods
  • Jupyter pod lifecycle management
  • FastAPI serve deployment orchestration

3. Feature Store

High-performance feature serving and engineering platform

Components:

  • darwin-ofs-v2 (App): Low-latency online feature serving (<10ms)
  • darwin-ofs-v2-admin: Feature group management, schema versioning
  • darwin-ofs-v2-consumer: Kafka-based feature materialization
  • darwin-ofs-v2-populator: Bulk ingestion from Parquet/Delta tables

Capabilities:

  • Real-time feature retrieval with Cassandra backend
  • Point-in-time correctness for training datasets
  • Feature validation and schema evolution
  • Spark integration for batch feature pipelines
  • Multi-tenant feature isolation

Storage Architecture:

  • Cassandra: High-throughput feature values
  • MySQL: Feature metadata and schemas
  • Kafka: Real-time feature streaming

SDK: darwin_fs

from darwin_fs import FeatureStoreClient

fs = FeatureStoreClient()
features = fs.fetch_features(
    feature_group="user_engagement",
    keys=[123, 456]
)

4. ML Serve

Production model deployment and serving platform

  • Serve Lifecycle: Create, configure, deploy, monitor, undeploy
  • Multi-Environment: Dev, staging, UAT, production with environment-specific configs
  • Backend Support:
    • FastAPI serves for REST inference
    • Ray Serve for distributed model serving (experimental)
  • Artifact Management: Git-based Docker image builds
  • Auto-Scaling: HPA-based horizontal pod autoscaling
  • Feature Store Integration: Native integration for online feature retrieval

Deployment Workflow:

# Complete model deployment via Hermes CLI

# 1. Configure authentication
export HERMES_USER_TOKEN=admin-token-default-change-in-production
hermes configure

# 2. Create environment (one-time setup)
hermes create-environment --name local --domain-suffix .local --cluster-name kind

# 3. Create serve definition
hermes create-serve --name my-model --type api --space serve --description "My ML model"

# 4. Deploy model
hermes deploy-model \
  --serve-name my-model \
  --artifact-version v1 \
  --model-uri mlflow-artifacts:/1/abc123/artifacts/model \
  --cores 4 \
  --memory 8 \
  --node-capacity spot \
  --min-replicas 2 \
  --max-replicas 10

πŸ“– For detailed Hermes CLI commands and options, see hermes-cli/CLI.md


5. Artifact Builder

Docker image building service for ML models

  • Build images from GitHub repositories with custom Dockerfiles
  • Queue-based build system with status tracking
  • Container registry integration (ECR, GCR)
  • Integration with ML Serve deployment pipeline

6. Darwin MLflow

Experiment tracking and model registry

  • MLflow 2.12.2 with custom FastAPI authentication layer
  • Experiment and run tracking (parameters, metrics, artifacts)
  • Model registry with versioning
  • User-based experiment permissions
  • S3/LocalStack artifact storage
  • Custom UI with enhanced authorization

SDK: darwin_mlflow (wraps MLflow client)

import darwin_mlflow as mlflow

mlflow.log_params({"lr": 0.001, "epochs": 100})
mlflow.log_metric("accuracy", 0.95)
mlflow.sklearn.log_model(model, "model")

7. Chronos (Event Processing & Metadata)

Event ingestion, transformation, and lineage tracking

  • Event Sources: REST API for raw events from services
  • Transformers: Python/JSONPath-based event processing
  • Entity Extraction: Automatic entity creation (clusters, users, jobs)
  • Relationship Mapping: Build lineage graphs between entities
  • Queue Processing: Async consumption from Kafka/SQS

Use Cases:

  • Cluster lifecycle tracking
  • Workflow execution lineage
  • Audit logs and compliance
  • Metadata dependencies (data β†’ model β†’ deployment)

8. Darwin Workspace

Project and development environment management

  • Project Management: Multi-user project organization
  • Codespace Lifecycle: Create and manage Jupyter/VSCode environments
  • Compute Integration: Attach Ray clusters to development environments
  • Shared Storage: FSx/EFS integration for persistent workspaces
  • Event Publishing: Workspace state changes tracked in Chronos

9. Darwin Catalog

Data asset discovery and governance

  • Asset Management: Register datasets, tables, models
  • Schema Tracking: Schema evolution and versioning
  • Lineage: OpenLineage-based data lineage tracking
  • Search: Full-text search across data assets
  • Metadata: Tags, descriptions, ownership, quality metrics
  • Integration: Spark and Airflow job lineage capture

10. Hermes CLI

Command-line tool for streamlined ML operations

  • Environment configuration and management
  • Model serving project scaffolding (FastAPI templates)
  • One-click model deployment
  • Artifact build and deploy orchestration
  • Configuration management (.hermes folder)

Installation & Setup:

# Included with Darwin Distribution
source hermes-cli/.venv/bin/activate

# Configure authentication
export HERMES_USER_TOKEN=admin-token-default-change-in-production
hermes configure

# Create environment (one-time setup per environment)
hermes create-environment \
  --name local \
  --domain-suffix .local \
  --cluster-name kind \
  --namespace serve

πŸ“– For complete Hermes CLI documentation, see hermes-cli/CLI.md


πŸ‘₯ User Personas

Data Scientists

Use Darwin for: Experimentation, training, model development

  • Launch Ray clusters via SDK for distributed training
  • Track experiments with MLflow
  • Access features from Feature Store
  • Deploy models with one-click Hermes CLI commands

ML Engineers

Use Darwin for: Production model deployment and monitoring

  • Configure multi-environment serves (dev/staging/prod)
  • Build and deploy artifacts from GitHub
  • Manage auto-scaling policies
  • Monitor model performance and resource usage

Data Engineers

Use Darwin for: Feature pipelines and data infrastructure

  • Create and manage feature groups in Feature Store
  • Build Spark-based feature engineering pipelines
  • Track data lineage in Catalog
  • Publish features to Kafka for real-time materialization

Platform Engineers

Use Darwin for: Infrastructure management and operations

  • Deploy and configure Darwin platform via Helm
  • Manage Kubernetes resources and policies
  • Monitor costs and resource utilization
  • Configure multi-tenancy and RBAC

πŸš€ Getting Started

Prerequisites

  • Kubernetes cluster (Kind for local, EKS for production)
  • Helm 3.8+
  • kubectl
  • Docker
  • Python 3.9.7+

Quick Start: Local Deployment

# 1. Initialize configuration (select components to enable)
./init.sh

# 2. Build platform images and setup Kind cluster
./setup.sh

# 3. Deploy Darwin platform to Kubernetes
./start.sh

Choosing Your Components

During init.sh, you'll select which Darwin components to enable. Here's how to decide based on your workflow:

If you want to... Enable
Run distributed data processing jobs or spin up short-lived compute clusters Compute
Work interactively with persistent code and notebooks attached to scalable clusters Workspace (includes Compute)
Store, version, and serve features for ML training and inference Feature Store
Track experiments, log metrics, and manage model versions MLflow
Deploy trained models as real-time inference endpoints Serve (includes Artifact Builder)
Discover and track lineage across datasets, models, and pipelines Catalog
Capture platform events and build metadata graphs Chronos

Tip: Dependencies are resolved automatically. For example, enabling Workspace will also enable Compute, and enabling Serve will include Artifact Builder and MLflow.

Access Services:

  • Compute: http://localhost/compute/*
  • Feature Store: http://localhost/feature-store/*
  • MLflow UI: http://localhost/mlflow/*
  • Chronos API: http://localhost/chronos/*
  • Catalog API: http://localhost/darwin-catalog/*
  • Workspace: http://localhost/workspace/*

Quick Start: Create and Use a Ray Cluster

# Create a cluster via REST API
curl --location 'http://localhost/compute/cluster' \
  --header 'Content-Type: application/json' \
  --data-raw '{
    "cluster_name": "my-first-cluster",
    "tags": ["demo"],
    "runtime": "0.0",
    "inactive_time": 30,
    "start_cluster": true,
    "head_node_config": {
        "cores": 4,
        "memory": 8,
        "node_capacity_type": "ondemand"
    },
    "worker_node_configs": [
        {
            "cores_per_pods": 2,
            "memory_per_pods": 4,
            "min_pods": 1,
            "max_pods": 2,
            "disk_setting": null,
            "node_capacity_type": "ondemand"
        }
    ],
    "user": "[email protected]"
}'

# Wait for Cluster to become Active
curl http://localhost/compute/cluster/{cluster_id}/metadata
# Wait until the status shows active.

# Response will include cluster_id
# Get Cluster Dashboards link via below API using cluster_id
curl --location 'http://localhost/compute/cluster/{cluster_id}/dashboards'
# Access Jupyter notebook at the returned jupyter_lab_url
# Monitor Ray cluster at the ray_dashboard_url

# Stop the cluster when done
curl --location --request POST 'http://localhost/compute/cluster/stop-cluster/{cluster_id}' \
  --header 'msd-user: {"email": "[email protected]"}'

Understanding Runtime Parameter:

The runtime field specifies which pre-built Docker image to use for your Ray cluster. Darwin supports multiple runtimes with different Python versions and pre-installed libraries:

  • "0.0": Default runtime with Ray 2.37.0, Python 3.10, Spark 3.5.1, and darwin-sdk
  • Custom runtimes can be registered with specific library combinations

To check available runtimes:

curl http://localhost/compute/get-runtimes | python3 -m json.tool

Or use the Python SDK:

# Install SDK
pip install -e darwin-compute/sdk

# Create a cluster
from darwin_compute import ComputeCluster

cluster = ComputeCluster(env="darwin-local")
response = cluster.create_with_yaml("examples/cluster-config.yaml")
cluster_id = response['cluster_id']

# Check and wait until cluster status becomes active
cluster.get_info(cluster_id)

# Stop when done
cluster.stop(cluster_id)

Quick Start: Deploy a Model

# Activate Hermes CLI
source hermes-cli/.venv/bin/activate

# 1. Configure Hermes CLI with authentication token
export HERMES_USER_TOKEN=admin-token-default-change-in-production
hermes configure

# 2. Create environment
hermes create-environment --name local --domain-suffix .local --cluster-name kind

# 3. Create serve
hermes create-serve \
  --name iris-classifier \
  --type api \
  --space serve \
  --description "Iris classification model"

# 4. Deploy model (one-click)
hermes deploy-model \
  --serve-name iris-classifier \
  --artifact-version v1 \
  --model-uri mlflow-artifacts:/1/2b2b1b5727a14c5ca81b44e899979745/artifacts/model \
  --cores 2 \
  --memory 4 \
  --node-capacity spot \
  --min-replicas 1 \
  --max-replicas 2

# 5. Make predictions
curl -X POST http://localhost/iris-classifier/predict \
  -H "Content-Type: application/json" \
  -d '{"features": [[5.1, 3.5, 1.4, 0.2]]}'

πŸ“– For more deployment options, see hermes-cli/CLI.md

Quick Start: Use Feature Store

# Install SDK
pip install -e feature-store/python/darwin_fs

# Fetch features
from darwin_fs import FeatureStoreClient

fs = FeatureStoreClient(env="local")
features = fs.fetch_features(
    feature_group_name="user_features",
    feature_columns=["age", "tenure", "activity_score"],
    primary_key_names=["user_id"],
    primary_key_values=[[123], [456], [789]]
)

🎯 Quick Start: Submit a Spark Job Using Darwin SDK

Darwin SDK provides seamless integration with Apache Spark on Ray clusters. Here's how to run distributed Spark workloads using Darwin as your Spark session provider:

Step 1: Create a Ray Cluster

curl --location 'http://localhost/compute/cluster' \
  --header 'Content-Type: application/json' \
  --data-raw '{
    "cluster_name": "spark-demo-cluster",
    "tags": ["spark", "demo"],
    "runtime": "0.0",
    "inactive_time": 60,
    "start_cluster": true,
    "head_node_config": {
        "cores": 4,
        "memory": 8,
        "node_capacity_type": "ondemand"
    },
    "worker_node_configs": [{
        "cores_per_pods": 2,
        "memory_per_pods": 4,
        "min_pods": 1,
        "max_pods": 2,
        "disk_setting": null,
        "node_capacity_type": "ondemand"
    }],
    "user": "[email protected]"
}'

Save the cluster_id from the response.

Step 2: Wait for Cluster to be Ready

# Check cluster status
curl http://localhost/compute/cluster/{cluster_id}/metadata

# Wait until status shows "active"
# Then verify pods are running
kubectl get pods -n ray -l ray.io/cluster={cluster_id}-kuberay

Step 3: Create Your Spark Job

Create a file my_spark_job.py:

#!/usr/bin/env python3
"""
Darwin SDK Spark Job Example
"""
import os
import ray

# Initialize Ray (connects to running Ray cluster)
ray.init()

# Set environment variables
os.environ["ENV"] = "LOCAL"
os.environ["CLUSTER_ID"] = os.getenv("CLUSTER_ID", "your-cluster-id")
os.environ["DARWIN_COMPUTE_URL"] = "http://darwin-compute.darwin.svc.cluster.local:8000"

print("=" * 60)
print("Darwin SDK Spark Job")
print(f"Cluster ID: {os.environ['CLUSTER_ID']}")
print("=" * 60)

# Initialize Spark using darwin-sdk
from darwin import init_spark_with_configs

spark_configs = {
    "spark.sql.execution.arrow.pyspark.enabled": "true",
    "spark.sql.session.timeZone": "UTC",
    "spark.sql.shuffle.partitions": "10",
    "spark.default.parallelism": "10",
    "spark.driver.memory": "1g",
    "spark.executor.memory": "1g",
}

spark = init_spark_with_configs(spark_configs=spark_configs)
print(f"βœ“ Spark initialized (version: {spark.version})")

# Create and process DataFrame
df = spark.createDataFrame([
    (1, "Alice", 100),
    (2, "Bob", 200),
    (3, "Charlie", 300),
], ["id", "name", "score"])

print("\nDataFrame Contents:")
df.show()

print(f"\nTotal records: {df.count()}")
print(f"Average score: {df.agg({'score': 'avg'}).collect()[0][0]}")

# Stop Spark cleanly
from darwin import stop_spark
stop_spark()

print("\nβœ“ Job completed successfully!")

Step 4: Submit the Job

Option A: Using submit_spark_job.sh Script

cd darwin-sdk/darwin
./submit_spark_job.sh \
  --cluster-name {cluster_id} \
  --namespace ray \
  --job-file /path/to/my_spark_job.py \
  --wait

Option B: Using Ray Jobs API

# Port-forward to Ray dashboard
kubectl port-forward -n ray svc/{cluster_id}-kuberay-head-svc 8265:8265 &

# Submit job
curl -X POST "http://localhost:8265/api/jobs/" \
  -H "Content-Type: application/json" \
  -d '{
    "entrypoint": "python my_spark_job.py",
    "runtime_env": {
      "working_dir": "./",
      "env_vars": {
        "CLUSTER_ID": "'{cluster_id}'",
        "ENV": "LOCAL"
      }
    },
    "metadata": {
      "name": "darwin-spark-demo"
    }
  }'

Option C: Using Ray Python Client

from ray.job_submission import JobSubmissionClient

client = JobSubmissionClient("http://localhost:8265")

job_id = client.submit_job(
    entrypoint="python my_spark_job.py",
    runtime_env={
        "working_dir": "./",
        "env_vars": {
            "CLUSTER_ID": "{cluster_id}",
            "ENV": "LOCAL"
        }
    }
)

print(f"Submitted job: {job_id}")

# Wait for completion
client.wait_until_status(job_id, "SUCCEEDED")
print(client.get_job_logs(job_id))

Step 5: Monitor Job Execution

# Check job status
SUBMISSION_ID="raysubmit_xxxxxxxx"
curl "http://localhost:8265/api/jobs/${SUBMISSION_ID}"

# View job logs
curl "http://localhost:8265/api/jobs/${SUBMISSION_ID}/logs"

# Or use Ray Dashboard
open http://localhost:8265

Darwin SDK Spark Functions

Function Description
init_spark_with_configs(spark_configs) Initialize Spark with custom configurations
start_spark(spark_conf) Start Spark with default Glue catalog configs
get_raydp_spark_session() Get existing Spark session
stop_spark() Stop Spark session cleanly

Troubleshooting

Issue: "Runtime given is incorrect"

# Check available runtimes
curl http://localhost/compute/get-runtimes

Issue: Ray job stuck in PENDING

# Check Ray head pod
kubectl describe pod {cluster_id}-kuberay-head-xxx -n ray

Issue: Connection refused when submitting job

# Restart port-forward
pkill -f "port-forward.*8265"
kubectl port-forward -n ray svc/{cluster_id}-kuberay-head-svc 8265:8265 &

Issue: Cluster not starting due to long init script

If your init_script in the cluster configuration is too long, the cluster may fail to start. This happens because init scripts are executed during pod startup and have timeout limitations.

Solutions:

  • Use the Library Installation API to install packages instead of init scripts
  • Create a custom runtime with your dependencies pre-installed
  • Split long scripts into smaller, essential commands

πŸ§ͺ Creating Your First Project

This guide walks you through your first end-to-end experience on Darwin β€” from compute creation to deployment.

πŸ”§ 1) Create Compute

Create a Ray cluster for your ML workload:

curl --location 'http://localhost/compute/cluster' \
  --header 'Content-Type: application/json' \
  --data-raw '{
    "cluster_name": "housing-project",
    "tags": ["tutorial", "housing-prices"],
    "runtime": "0.0",
    "inactive_time": 60,
    "head_node_config": {
        "cores": 4,
        "memory": 8
    },
    "worker_node_configs": [
        {
            "cores": 2,
            "memory": 4,
            "min_pods": 1,
            "max_pods": 2
        }
    ],
    "user": "[email protected]"
}'

Save the cluster_id from the response - you'll need it for the next steps.

πŸ“Š 2) Check Status

Check your cluster status:

curl http://localhost/compute/cluster/{cluster_id}/metadata

Wait until the status shows active.

πŸ““ 3) Open Jupyter Notebook

Once the cluster is ready, access the Jupyter notebook at:

http://localhost/kind-0/{cluster_id}-jupyter

Open this URL in your browser to start working in the workspace.

🏑 4) Copy and Run Example: housing-prices

In the Jupyter notebook, copy the example project: /examples/housing-prices/ in Jupyter notebook. The model will be logged automatically to MLflow.

🏷️ 5) Check Your Model in the Registry

Verify your trained model in the Darwin MLflow UI:

http://localhost/mlflow-app/experiments

Navigate to your experiment to see the registered model with metrics and parameters.

πŸš€ 6) Deploy with Hermes CLI

Deploy your trained model (replace <experiment_id> and <run_id> with values from MLflow UI):

πŸ“– Sample training script for house price prediction: examples/house-price-prediction/train_house_pricing_model.ipynb

# Activate Hermes CLI
source hermes-cli/.venv/bin/activate

# 1. Configure Hermes CLI with authentication token (one-time)
export HERMES_USER_TOKEN=admin-token-default-change-in-production
hermes configure

# 2. Create environment
hermes create-environment --name local --domain-suffix .local --cluster-name kind

# 3. Create serve
hermes create-serve \
  --name housing-model \
  --type api \
  --space serve \
  --description "House Price Prediction model"

# 4. Deploy model (one-click)
hermes deploy-model \
  --serve-name housing-model \
  --artifact-version v1 \
  --model-uri mlflow-artifacts:/1/<experiment_id>/<run_id>/artifacts/model \
  --cores 2 \
  --memory 4 \
  --node-capacity spot \
  --min-replicas 1 \
  --max-replicas 2

🌐 7) Test Your Endpoint

Test your deployed model:

curl -X POST http://localhost/housing-model/predict \
  -H "Content-Type: application/json" \
  -d '{
    "features": {
      "MedInc": 3.5214,
      "HouseAge": 15.0,
      "AveRooms": 6.575757575757576,
      "AveBedrms": 1.0196969696969697,
      "Population": 1447.0,
      "AveOccup": 3.0144927536231883,
      "Latitude": 37.63,
      "Longitude": -122.43
    }
  }'

Once deployed, your model is accessible as a real-time inference API.


🌸 Alternative Example: Iris Classification

You can also try the Iris classification model as an alternative example:

πŸ“– Sample training script for iris classification: examples/iris-classification/train_iris_model.ipynb

Deploy the Iris model:

# Activate Hermes CLI
source hermes-cli/.venv/bin/activate

# 1. Configure Hermes CLI with authentication token (one-time)
export HERMES_USER_TOKEN=admin-token-default-change-in-production
hermes configure

# 2. Create environment
hermes create-environment --name local --domain-suffix .local --cluster-name kind

# 3. Create serve
hermes create-serve \
  --name iris-classifier \
  --type api \
  --space serve \
  --description "Iris Species Classification model"

# 4. Deploy model (one-click)
hermes deploy-model \
  --serve-name iris-classifier \
  --model-uri mlflow-artifacts:/<experiment_id>/<run_id>/artifacts/model \
  --cores 2 \
  --memory 4 \
  --node-capacity spot \
  --min-replicas 1 \
  --max-replicas 2

Test the Iris model:

curl -X POST http://localhost/iris-classifier/predict \
  -H "Content-Type: application/json" \
  -d '{
    "features": {
      "sepal_length": 5.1,
      "sepal_width": 3.5,
      "petal_length": 1.4,
      "petal_width": 0.2
    }
  }'

πŸ“– For detailed deployment commands, see hermes-cli/CLI.md


πŸ§ͺ Example Workflows

End-to-End ML Workflow

sequenceDiagram
    participant DS as Data Scientist
    participant Compute as Darwin Compute
    participant MLflow as MLflow
    participant FS as Feature Store
    participant Serve as ML Serve

    DS->>Compute: Create Ray cluster
    Compute-->>DS: Cluster ID + Jupyter link
    DS->>FS: Fetch training features
    FS-->>DS: Feature dataset
    DS->>Compute: Train model (Ray/Spark)
    DS->>MLflow: Log experiment + model
    DS->>Serve: Deploy model (Hermes CLI)
    Serve->>MLflow: Fetch model artifact
    Serve->>FS: Configure feature retrieval
    Serve-->>DS: Inference endpoint
    DS->>Compute: Stop cluster
Loading

Feature Engineering Pipeline

graph LR
    A[Raw Data<br/>S3/Delta Lake] --> B[Spark Pipeline<br/>Feature Transform]
    B --> C[Kafka Topic<br/>feature-updates]
    C --> D[Feature Store<br/>Consumer]
    D --> E[Cassandra<br/>Materialized Features]
    E --> F[Online Serving<br/>< 10ms latency]
    B --> G[Catalog<br/>Lineage Tracking]
Loading

πŸ“Š Technology Stack

Layer Technologies
Languages Python 3.9.7, Java 11, Go 1.18
Compute Ray 2.37.0, Apache Spark 3.5.1
Web Frameworks FastAPI, Spring Boot, Vert.x
Orchestration Kubernetes (EKS/Kind), Helm 3, KubeRay Operator v1.1.0
Databases MySQL 8.0, Cassandra 5.0, OpenSearch 2.11
Streaming Apache Kafka 7.4.0
Storage S3 (AWS/LocalStack), FSx, EFS
Experiment Tracking MLflow 2.12.2
Monitoring Prometheus, Grafana, Ray Dashboard
Container Registry ECR, GCR, Local Docker Registry

πŸ”§ Configuration

Darwin uses a declarative configuration approach:

Service Selection (init.sh)

Interactive wizard to select platform components:

./init.sh
# Prompts for enabling:
# - Applications (Compute, Feature Store, MLflow, etc.)
# - Datastores (MySQL, Cassandra, Kafka, etc.)
# - Ray images and Serve runtimes

Generates .setup/enabled-services.yaml with user selections.

Environment Variables

Key configuration via config.env (auto-generated):

KUBECONFIG=./kind/config/kindkubeconfig.yaml
DOCKER_REGISTRY=127.0.0.1:32768

Helm Values

Customize deployments via helm/darwin/values.yaml:

global:
  namespace: darwin
  
services:
  compute:
    enabled: true
    replicas: 2
  
datastores:
  mysql:
    enabled: true
  cassandra:
    enabled: true

🏒 Deployment Patterns

Local Development (Kind)

  • Single-node Kubernetes cluster
  • Local Docker registry
  • HostPath-based persistent storage
  • Nginx Ingress at localhost/*

Production (EKS)

  • Multi-AZ high availability
  • Mixed spot/on-demand node groups
  • Auto-scaling with Karpenter
  • Network policies and security groups
  • S3-backed artifact storage
  • RDS for MySQL (optional)
  • Multi-tenant namespace isolation

πŸ“ˆ Observability

Metrics

  • Prometheus: Cluster resource utilization, service metrics
  • Grafana: Pre-configured dashboards for compute, serving, features
  • Ray Dashboard: Job execution, task profiling, resource usage

Logging

  • Centralized logging via stdout/stderr
  • Application logs in /app/logs
  • Structured logging with context

Events & Lineage

  • Chronos: Event-driven tracking of all platform operations
  • Catalog: Data lineage via OpenLineage
  • Elasticsearch-based search and analytics

Alerts

  • Slack integration for cost alerts
  • Long-running cluster notifications
  • Failed deployment alerts

πŸ“š SDKs & APIs

Available SDKs

  • darwin-compute: Ray cluster management
  • darwin_fs: Feature Store client
  • darwin_mlflow: MLflow wrapper with auth
  • darwin-workspace (internal): Workspace orchestration

REST APIs

All services expose FastAPI/Spring Boot REST APIs:

  • Feature Store: /feature-store/*, /feature-store-admin/*
  • Darwin Compute: /cluster/*, /jupyter/*
  • ML Serve: /api/v1/serve/*, /api/v1/artifact/*
  • Chronos: /api/v1/event/*, /api/v1/sources/*
  • Catalog: /v1/assets/*, /v1/lineage/*

API documentation available at <service-url>/docs (Swagger UI).


🧩 Extensibility

Available Runtimes

Darwin provides pre-built Ray runtimes for cluster creation. The Runtime Name is what you pass to the API when creating clusters (e.g., "runtime": "0.1").

Runtime Name Image Ray Version Python Class Type
0.0 ray:2.37.0 2.37.0 3.10 CPU Ray Only
0.1 ray:2.53.0 2.53.0 3.10 CPU Ray Only

Custom Runtimes

Add new Ray runtimes by creating Dockerfiles in darwin-compute/runtimes/:

# darwin-compute/runtimes/cpu/Ray2.37_Py3.11_CustomLibs/Dockerfile
FROM rayproject/ray:2.37.0-py311
RUN pip install jupyterlab==4.3.0
RUN pip install custom-library

Register in services.yaml:

ray-images:
  - image-name: ray:2.37.0-py311-custom
    dockerfile-path: darwin-compute/runtimes/cpu/Ray2.37_Py3.11_CustomLibs

Custom Transformers (Chronos)

Create Python transformers for event processing:

# Chronos transformer
def transform(event):
    return {
        "event_type": "cluster_created",
        "entities": [{"type": "cluster", "id": event["cluster_id"]}],
        "relationships": [{"from": user, "to": cluster, "type": "owns"}]
    }

🀝 Contributing

See CONTRIBUTING.md for development setup, coding standards, and contribution guidelines.


πŸ“„ License

[License information to be added]


πŸ“ž Support

For issues, questions, or feature requests, please open an issue in the repository or contact the platform team.


πŸ—ΊοΈ Project Structure

darwin-distro/
β”œβ”€β”€ darwin-compute/          # Ray cluster management service
β”œβ”€β”€ darwin-cluster-manager/  # Kubernetes orchestration (Go)
β”œβ”€β”€ feature-store/           # Feature Store (Java)
β”œβ”€β”€ mlflow/                  # MLflow experiment tracking
β”œβ”€β”€ ml-serve-app/            # Model serving platform
β”œβ”€β”€ artifact-builder/        # Docker image builder
β”œβ”€β”€ chronos/                 # Event processing & metadata
β”œβ”€β”€ workspace/               # Project & codespace management
β”œβ”€β”€ darwin-catalog/          # Data catalog & lineage
β”œβ”€β”€ hermes-cli/              # CLI tool for deployments
β”œβ”€β”€ helm/                    # Helm charts for deployment
β”‚   └── darwin/              # Umbrella chart
β”‚       β”œβ”€β”€ charts/
β”‚       β”‚   β”œβ”€β”€ datastores/  # MySQL, Cassandra, Kafka, etc.
β”‚       β”‚   └── services/    # Application services
β”œβ”€β”€ deployer/                # Build scripts and base images
β”œβ”€β”€ kind/                    # Local Kubernetes setup
β”œβ”€β”€ examples/                # Example notebooks and configs
β”œβ”€β”€ init.sh                  # Interactive configuration wizard
β”œβ”€β”€ setup.sh                 # Build and cluster setup
β”œβ”€β”€ start.sh                 # Deploy platform
└── services.yaml            # Service registry

Darwin ML Platform β€” Unified, scalable, production-ready machine learning infrastructure.

About

ML Platform at Dream11

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 13