Darwin is an enterprise-grade, end-to-end machine learning platform designed for production-scale AI/ML workloads. It provides a unified ecosystem for the complete ML lifecycleβfrom distributed compute and feature engineering to experiment tracking, model deployment, and real-time inference serving.
Darwin solves critical challenges in production ML infrastructure:
- Unified Platform: Single platform for training, serving, and feature engineeringβno context switching between disparate tools
- Production-Grade Scalability: Built on Kubernetes and Ray for elastic, distributed compute at scale
- Cost Optimization: Intelligent auto-scaling, spot instance support, and policy-based auto-termination
- Developer Velocity: SDK-first design with CLI tools for rapid experimentation and deployment
- Enterprise Ready: Multi-tenancy, RBAC, audit logging, and metadata lineage out of the box
- Low-Latency Serving: Sub-10ms feature retrieval and optimized model inference pipelines
graph TB
subgraph "User Interface Layer"
UI[Workspace UI/Jupyter]
CLI[Hermes CLI]
SDK[Python SDKs]
end
subgraph "Orchestration Layer"
Workspace[Workspace Service<br/>Projects & Codespaces]
MLflow[MLflow<br/>Experiment Tracking]
Chronos[Chronos<br/>Event & Metadata]
end
subgraph "Compute Layer"
Compute[Darwin Compute<br/>Cluster Management]
DCM[Darwin Cluster Manager<br/>K8s Orchestration]
Ray[Ray Clusters<br/>Distributed Execution]
end
subgraph "Data Layer"
FS[Feature Store<br/>Online/Offline Features]
Catalog[Darwin Catalog<br/>Asset Discovery]
end
subgraph "Serving Layer"
Serve[ML Serve<br/>Model Deployment]
Builder[Artifact Builder<br/>Image Building]
end
subgraph "Infrastructure"
MySQL[(MySQL<br/>Metadata)]
Cassandra[(Cassandra<br/>Features)]
OpenSearch[(OpenSearch<br/>Events)]
S3[(S3<br/>Artifacts)]
Kafka[(Kafka<br/>Streaming)]
K8s[Kubernetes/EKS]
end
UI --> Workspace
CLI --> Serve
CLI --> Compute
SDK --> Compute
SDK --> FS
SDK --> MLflow
Workspace --> Compute
Workspace --> Chronos
MLflow --> S3
MLflow --> MySQL
Compute --> DCM
DCM --> Ray
Ray --> K8s
Serve --> Builder
Serve --> DCM
Builder --> K8s
FS --> Cassandra
FS --> Kafka
FS --> MySQL
Catalog --> MySQL
Catalog --> OpenSearch
Chronos --> OpenSearch
Chronos --> Kafka
style Darwin fill:#e1f5ff
style Compute fill:#ffe1e1
style FS fill:#e1ffe1
style Serve fill:#fff5e1
Distributed compute orchestration for ML workloads
- Ray Cluster Management: Create, scale, and manage Ray 2.37.0 clusters on Kubernetes
- Multi-Runtime Support: Pre-configured runtimes (Ray + Python 3.10 + Spark 3.5.1)
- Resource Optimization:
- Spot/on-demand instance mixing
- Auto-termination policies (idle detection, CPU thresholds)
- Cost monitoring with Slack alerts
- Package Management: Dynamic installation of PyPI, Maven, and workspace packages
- Jupyter Integration: Managed Jupyter notebooks with direct cluster access
- Job Scheduling: Ray job submission and monitoring
SDK: darwin-compute
from darwin_compute import ComputeCluster
cluster = ComputeCluster(env="prod")
result = cluster.create_with_yaml("cluster-config.yaml")
cluster.start(cluster_id=result['cluster_id'])Low-level Kubernetes orchestration service (Go)
- Helm-based Ray cluster deployment via KubeRay operator
- Dynamic values.yaml generation for cluster configurations
- Remote command execution on cluster pods
- Jupyter pod lifecycle management
- FastAPI serve deployment orchestration
High-performance feature serving and engineering platform
Components:
- darwin-ofs-v2 (App): Low-latency online feature serving (<10ms)
- darwin-ofs-v2-admin: Feature group management, schema versioning
- darwin-ofs-v2-consumer: Kafka-based feature materialization
- darwin-ofs-v2-populator: Bulk ingestion from Parquet/Delta tables
Capabilities:
- Real-time feature retrieval with Cassandra backend
- Point-in-time correctness for training datasets
- Feature validation and schema evolution
- Spark integration for batch feature pipelines
- Multi-tenant feature isolation
Storage Architecture:
- Cassandra: High-throughput feature values
- MySQL: Feature metadata and schemas
- Kafka: Real-time feature streaming
SDK: darwin_fs
from darwin_fs import FeatureStoreClient
fs = FeatureStoreClient()
features = fs.fetch_features(
feature_group="user_engagement",
keys=[123, 456]
)Production model deployment and serving platform
- Serve Lifecycle: Create, configure, deploy, monitor, undeploy
- Multi-Environment: Dev, staging, UAT, production with environment-specific configs
- Backend Support:
- FastAPI serves for REST inference
- Ray Serve for distributed model serving (experimental)
- Artifact Management: Git-based Docker image builds
- Auto-Scaling: HPA-based horizontal pod autoscaling
- Feature Store Integration: Native integration for online feature retrieval
Deployment Workflow:
# Complete model deployment via Hermes CLI
# 1. Configure authentication
export HERMES_USER_TOKEN=admin-token-default-change-in-production
hermes configure
# 2. Create environment (one-time setup)
hermes create-environment --name local --domain-suffix .local --cluster-name kind
# 3. Create serve definition
hermes create-serve --name my-model --type api --space serve --description "My ML model"
# 4. Deploy model
hermes deploy-model \
--serve-name my-model \
--artifact-version v1 \
--model-uri mlflow-artifacts:/1/abc123/artifacts/model \
--cores 4 \
--memory 8 \
--node-capacity spot \
--min-replicas 2 \
--max-replicas 10π For detailed Hermes CLI commands and options, see hermes-cli/CLI.md
Docker image building service for ML models
- Build images from GitHub repositories with custom Dockerfiles
- Queue-based build system with status tracking
- Container registry integration (ECR, GCR)
- Integration with ML Serve deployment pipeline
Experiment tracking and model registry
- MLflow 2.12.2 with custom FastAPI authentication layer
- Experiment and run tracking (parameters, metrics, artifacts)
- Model registry with versioning
- User-based experiment permissions
- S3/LocalStack artifact storage
- Custom UI with enhanced authorization
SDK: darwin_mlflow (wraps MLflow client)
import darwin_mlflow as mlflow
mlflow.log_params({"lr": 0.001, "epochs": 100})
mlflow.log_metric("accuracy", 0.95)
mlflow.sklearn.log_model(model, "model")Event ingestion, transformation, and lineage tracking
- Event Sources: REST API for raw events from services
- Transformers: Python/JSONPath-based event processing
- Entity Extraction: Automatic entity creation (clusters, users, jobs)
- Relationship Mapping: Build lineage graphs between entities
- Queue Processing: Async consumption from Kafka/SQS
Use Cases:
- Cluster lifecycle tracking
- Workflow execution lineage
- Audit logs and compliance
- Metadata dependencies (data β model β deployment)
Project and development environment management
- Project Management: Multi-user project organization
- Codespace Lifecycle: Create and manage Jupyter/VSCode environments
- Compute Integration: Attach Ray clusters to development environments
- Shared Storage: FSx/EFS integration for persistent workspaces
- Event Publishing: Workspace state changes tracked in Chronos
Data asset discovery and governance
- Asset Management: Register datasets, tables, models
- Schema Tracking: Schema evolution and versioning
- Lineage: OpenLineage-based data lineage tracking
- Search: Full-text search across data assets
- Metadata: Tags, descriptions, ownership, quality metrics
- Integration: Spark and Airflow job lineage capture
Command-line tool for streamlined ML operations
- Environment configuration and management
- Model serving project scaffolding (FastAPI templates)
- One-click model deployment
- Artifact build and deploy orchestration
- Configuration management (
.hermesfolder)
Installation & Setup:
# Included with Darwin Distribution
source hermes-cli/.venv/bin/activate
# Configure authentication
export HERMES_USER_TOKEN=admin-token-default-change-in-production
hermes configure
# Create environment (one-time setup per environment)
hermes create-environment \
--name local \
--domain-suffix .local \
--cluster-name kind \
--namespace serveπ For complete Hermes CLI documentation, see hermes-cli/CLI.md
Use Darwin for: Experimentation, training, model development
- Launch Ray clusters via SDK for distributed training
- Track experiments with MLflow
- Access features from Feature Store
- Deploy models with one-click Hermes CLI commands
Use Darwin for: Production model deployment and monitoring
- Configure multi-environment serves (dev/staging/prod)
- Build and deploy artifacts from GitHub
- Manage auto-scaling policies
- Monitor model performance and resource usage
Use Darwin for: Feature pipelines and data infrastructure
- Create and manage feature groups in Feature Store
- Build Spark-based feature engineering pipelines
- Track data lineage in Catalog
- Publish features to Kafka for real-time materialization
Use Darwin for: Infrastructure management and operations
- Deploy and configure Darwin platform via Helm
- Manage Kubernetes resources and policies
- Monitor costs and resource utilization
- Configure multi-tenancy and RBAC
- Kubernetes cluster (Kind for local, EKS for production)
- Helm 3.8+
- kubectl
- Docker
- Python 3.9.7+
# 1. Initialize configuration (select components to enable)
./init.sh
# 2. Build platform images and setup Kind cluster
./setup.sh
# 3. Deploy Darwin platform to Kubernetes
./start.shDuring init.sh, you'll select which Darwin components to enable. Here's how to decide based on your workflow:
| If you want to... | Enable |
|---|---|
| Run distributed data processing jobs or spin up short-lived compute clusters | Compute |
| Work interactively with persistent code and notebooks attached to scalable clusters | Workspace (includes Compute) |
| Store, version, and serve features for ML training and inference | Feature Store |
| Track experiments, log metrics, and manage model versions | MLflow |
| Deploy trained models as real-time inference endpoints | Serve (includes Artifact Builder) |
| Discover and track lineage across datasets, models, and pipelines | Catalog |
| Capture platform events and build metadata graphs | Chronos |
Tip: Dependencies are resolved automatically. For example, enabling Workspace will also enable Compute, and enabling Serve will include Artifact Builder and MLflow.
Access Services:
- Compute:
http://localhost/compute/* - Feature Store:
http://localhost/feature-store/* - MLflow UI:
http://localhost/mlflow/* - Chronos API:
http://localhost/chronos/* - Catalog API:
http://localhost/darwin-catalog/* - Workspace:
http://localhost/workspace/*
# Create a cluster via REST API
curl --location 'http://localhost/compute/cluster' \
--header 'Content-Type: application/json' \
--data-raw '{
"cluster_name": "my-first-cluster",
"tags": ["demo"],
"runtime": "0.0",
"inactive_time": 30,
"start_cluster": true,
"head_node_config": {
"cores": 4,
"memory": 8,
"node_capacity_type": "ondemand"
},
"worker_node_configs": [
{
"cores_per_pods": 2,
"memory_per_pods": 4,
"min_pods": 1,
"max_pods": 2,
"disk_setting": null,
"node_capacity_type": "ondemand"
}
],
"user": "[email protected]"
}'
# Wait for Cluster to become Active
curl http://localhost/compute/cluster/{cluster_id}/metadata
# Wait until the status shows active.
# Response will include cluster_id
# Get Cluster Dashboards link via below API using cluster_id
curl --location 'http://localhost/compute/cluster/{cluster_id}/dashboards'
# Access Jupyter notebook at the returned jupyter_lab_url
# Monitor Ray cluster at the ray_dashboard_url
# Stop the cluster when done
curl --location --request POST 'http://localhost/compute/cluster/stop-cluster/{cluster_id}' \
--header 'msd-user: {"email": "[email protected]"}'Understanding Runtime Parameter:
The runtime field specifies which pre-built Docker image to use for your Ray cluster. Darwin supports multiple runtimes with different Python versions and pre-installed libraries:
"0.0": Default runtime with Ray 2.37.0, Python 3.10, Spark 3.5.1, and darwin-sdk- Custom runtimes can be registered with specific library combinations
To check available runtimes:
curl http://localhost/compute/get-runtimes | python3 -m json.toolOr use the Python SDK:
# Install SDK
pip install -e darwin-compute/sdk
# Create a cluster
from darwin_compute import ComputeCluster
cluster = ComputeCluster(env="darwin-local")
response = cluster.create_with_yaml("examples/cluster-config.yaml")
cluster_id = response['cluster_id']
# Check and wait until cluster status becomes active
cluster.get_info(cluster_id)
# Stop when done
cluster.stop(cluster_id)# Activate Hermes CLI
source hermes-cli/.venv/bin/activate
# 1. Configure Hermes CLI with authentication token
export HERMES_USER_TOKEN=admin-token-default-change-in-production
hermes configure
# 2. Create environment
hermes create-environment --name local --domain-suffix .local --cluster-name kind
# 3. Create serve
hermes create-serve \
--name iris-classifier \
--type api \
--space serve \
--description "Iris classification model"
# 4. Deploy model (one-click)
hermes deploy-model \
--serve-name iris-classifier \
--artifact-version v1 \
--model-uri mlflow-artifacts:/1/2b2b1b5727a14c5ca81b44e899979745/artifacts/model \
--cores 2 \
--memory 4 \
--node-capacity spot \
--min-replicas 1 \
--max-replicas 2
# 5. Make predictions
curl -X POST http://localhost/iris-classifier/predict \
-H "Content-Type: application/json" \
-d '{"features": [[5.1, 3.5, 1.4, 0.2]]}'π For more deployment options, see hermes-cli/CLI.md
# Install SDK
pip install -e feature-store/python/darwin_fs
# Fetch features
from darwin_fs import FeatureStoreClient
fs = FeatureStoreClient(env="local")
features = fs.fetch_features(
feature_group_name="user_features",
feature_columns=["age", "tenure", "activity_score"],
primary_key_names=["user_id"],
primary_key_values=[[123], [456], [789]]
)Darwin SDK provides seamless integration with Apache Spark on Ray clusters. Here's how to run distributed Spark workloads using Darwin as your Spark session provider:
curl --location 'http://localhost/compute/cluster' \
--header 'Content-Type: application/json' \
--data-raw '{
"cluster_name": "spark-demo-cluster",
"tags": ["spark", "demo"],
"runtime": "0.0",
"inactive_time": 60,
"start_cluster": true,
"head_node_config": {
"cores": 4,
"memory": 8,
"node_capacity_type": "ondemand"
},
"worker_node_configs": [{
"cores_per_pods": 2,
"memory_per_pods": 4,
"min_pods": 1,
"max_pods": 2,
"disk_setting": null,
"node_capacity_type": "ondemand"
}],
"user": "[email protected]"
}'Save the cluster_id from the response.
# Check cluster status
curl http://localhost/compute/cluster/{cluster_id}/metadata
# Wait until status shows "active"
# Then verify pods are running
kubectl get pods -n ray -l ray.io/cluster={cluster_id}-kuberayCreate a file my_spark_job.py:
#!/usr/bin/env python3
"""
Darwin SDK Spark Job Example
"""
import os
import ray
# Initialize Ray (connects to running Ray cluster)
ray.init()
# Set environment variables
os.environ["ENV"] = "LOCAL"
os.environ["CLUSTER_ID"] = os.getenv("CLUSTER_ID", "your-cluster-id")
os.environ["DARWIN_COMPUTE_URL"] = "http://darwin-compute.darwin.svc.cluster.local:8000"
print("=" * 60)
print("Darwin SDK Spark Job")
print(f"Cluster ID: {os.environ['CLUSTER_ID']}")
print("=" * 60)
# Initialize Spark using darwin-sdk
from darwin import init_spark_with_configs
spark_configs = {
"spark.sql.execution.arrow.pyspark.enabled": "true",
"spark.sql.session.timeZone": "UTC",
"spark.sql.shuffle.partitions": "10",
"spark.default.parallelism": "10",
"spark.driver.memory": "1g",
"spark.executor.memory": "1g",
}
spark = init_spark_with_configs(spark_configs=spark_configs)
print(f"β Spark initialized (version: {spark.version})")
# Create and process DataFrame
df = spark.createDataFrame([
(1, "Alice", 100),
(2, "Bob", 200),
(3, "Charlie", 300),
], ["id", "name", "score"])
print("\nDataFrame Contents:")
df.show()
print(f"\nTotal records: {df.count()}")
print(f"Average score: {df.agg({'score': 'avg'}).collect()[0][0]}")
# Stop Spark cleanly
from darwin import stop_spark
stop_spark()
print("\nβ Job completed successfully!")Option A: Using submit_spark_job.sh Script
cd darwin-sdk/darwin
./submit_spark_job.sh \
--cluster-name {cluster_id} \
--namespace ray \
--job-file /path/to/my_spark_job.py \
--waitOption B: Using Ray Jobs API
# Port-forward to Ray dashboard
kubectl port-forward -n ray svc/{cluster_id}-kuberay-head-svc 8265:8265 &
# Submit job
curl -X POST "http://localhost:8265/api/jobs/" \
-H "Content-Type: application/json" \
-d '{
"entrypoint": "python my_spark_job.py",
"runtime_env": {
"working_dir": "./",
"env_vars": {
"CLUSTER_ID": "'{cluster_id}'",
"ENV": "LOCAL"
}
},
"metadata": {
"name": "darwin-spark-demo"
}
}'Option C: Using Ray Python Client
from ray.job_submission import JobSubmissionClient
client = JobSubmissionClient("http://localhost:8265")
job_id = client.submit_job(
entrypoint="python my_spark_job.py",
runtime_env={
"working_dir": "./",
"env_vars": {
"CLUSTER_ID": "{cluster_id}",
"ENV": "LOCAL"
}
}
)
print(f"Submitted job: {job_id}")
# Wait for completion
client.wait_until_status(job_id, "SUCCEEDED")
print(client.get_job_logs(job_id))# Check job status
SUBMISSION_ID="raysubmit_xxxxxxxx"
curl "http://localhost:8265/api/jobs/${SUBMISSION_ID}"
# View job logs
curl "http://localhost:8265/api/jobs/${SUBMISSION_ID}/logs"
# Or use Ray Dashboard
open http://localhost:8265| Function | Description |
|---|---|
init_spark_with_configs(spark_configs) |
Initialize Spark with custom configurations |
start_spark(spark_conf) |
Start Spark with default Glue catalog configs |
get_raydp_spark_session() |
Get existing Spark session |
stop_spark() |
Stop Spark session cleanly |
Issue: "Runtime given is incorrect"
# Check available runtimes
curl http://localhost/compute/get-runtimesIssue: Ray job stuck in PENDING
# Check Ray head pod
kubectl describe pod {cluster_id}-kuberay-head-xxx -n rayIssue: Connection refused when submitting job
# Restart port-forward
pkill -f "port-forward.*8265"
kubectl port-forward -n ray svc/{cluster_id}-kuberay-head-svc 8265:8265 &Issue: Cluster not starting due to long init script
If your init_script in the cluster configuration is too long, the cluster may fail to start. This happens because init scripts are executed during pod startup and have timeout limitations.
Solutions:
- Use the Library Installation API to install packages instead of init scripts
- Create a custom runtime with your dependencies pre-installed
- Split long scripts into smaller, essential commands
This guide walks you through your first end-to-end experience on Darwin β from compute creation to deployment.
Create a Ray cluster for your ML workload:
curl --location 'http://localhost/compute/cluster' \
--header 'Content-Type: application/json' \
--data-raw '{
"cluster_name": "housing-project",
"tags": ["tutorial", "housing-prices"],
"runtime": "0.0",
"inactive_time": 60,
"head_node_config": {
"cores": 4,
"memory": 8
},
"worker_node_configs": [
{
"cores": 2,
"memory": 4,
"min_pods": 1,
"max_pods": 2
}
],
"user": "[email protected]"
}'Save the cluster_id from the response - you'll need it for the next steps.
Check your cluster status:
curl http://localhost/compute/cluster/{cluster_id}/metadataWait until the status shows active.
Once the cluster is ready, access the Jupyter notebook at:
http://localhost/kind-0/{cluster_id}-jupyter
Open this URL in your browser to start working in the workspace.
In the Jupyter notebook, copy the example project: /examples/housing-prices/ in Jupyter notebook. The model will be logged automatically to MLflow.
Verify your trained model in the Darwin MLflow UI:
http://localhost/mlflow-app/experiments
Navigate to your experiment to see the registered model with metrics and parameters.
Deploy your trained model (replace <experiment_id> and <run_id> with values from MLflow UI):
π Sample training script for house price prediction: examples/house-price-prediction/train_house_pricing_model.ipynb
# Activate Hermes CLI
source hermes-cli/.venv/bin/activate
# 1. Configure Hermes CLI with authentication token (one-time)
export HERMES_USER_TOKEN=admin-token-default-change-in-production
hermes configure
# 2. Create environment
hermes create-environment --name local --domain-suffix .local --cluster-name kind
# 3. Create serve
hermes create-serve \
--name housing-model \
--type api \
--space serve \
--description "House Price Prediction model"
# 4. Deploy model (one-click)
hermes deploy-model \
--serve-name housing-model \
--artifact-version v1 \
--model-uri mlflow-artifacts:/1/<experiment_id>/<run_id>/artifacts/model \
--cores 2 \
--memory 4 \
--node-capacity spot \
--min-replicas 1 \
--max-replicas 2Test your deployed model:
curl -X POST http://localhost/housing-model/predict \
-H "Content-Type: application/json" \
-d '{
"features": {
"MedInc": 3.5214,
"HouseAge": 15.0,
"AveRooms": 6.575757575757576,
"AveBedrms": 1.0196969696969697,
"Population": 1447.0,
"AveOccup": 3.0144927536231883,
"Latitude": 37.63,
"Longitude": -122.43
}
}'Once deployed, your model is accessible as a real-time inference API.
You can also try the Iris classification model as an alternative example:
π Sample training script for iris classification: examples/iris-classification/train_iris_model.ipynb
Deploy the Iris model:
# Activate Hermes CLI
source hermes-cli/.venv/bin/activate
# 1. Configure Hermes CLI with authentication token (one-time)
export HERMES_USER_TOKEN=admin-token-default-change-in-production
hermes configure
# 2. Create environment
hermes create-environment --name local --domain-suffix .local --cluster-name kind
# 3. Create serve
hermes create-serve \
--name iris-classifier \
--type api \
--space serve \
--description "Iris Species Classification model"
# 4. Deploy model (one-click)
hermes deploy-model \
--serve-name iris-classifier \
--model-uri mlflow-artifacts:/<experiment_id>/<run_id>/artifacts/model \
--cores 2 \
--memory 4 \
--node-capacity spot \
--min-replicas 1 \
--max-replicas 2Test the Iris model:
curl -X POST http://localhost/iris-classifier/predict \
-H "Content-Type: application/json" \
-d '{
"features": {
"sepal_length": 5.1,
"sepal_width": 3.5,
"petal_length": 1.4,
"petal_width": 0.2
}
}'π For detailed deployment commands, see hermes-cli/CLI.md
sequenceDiagram
participant DS as Data Scientist
participant Compute as Darwin Compute
participant MLflow as MLflow
participant FS as Feature Store
participant Serve as ML Serve
DS->>Compute: Create Ray cluster
Compute-->>DS: Cluster ID + Jupyter link
DS->>FS: Fetch training features
FS-->>DS: Feature dataset
DS->>Compute: Train model (Ray/Spark)
DS->>MLflow: Log experiment + model
DS->>Serve: Deploy model (Hermes CLI)
Serve->>MLflow: Fetch model artifact
Serve->>FS: Configure feature retrieval
Serve-->>DS: Inference endpoint
DS->>Compute: Stop cluster
graph LR
A[Raw Data<br/>S3/Delta Lake] --> B[Spark Pipeline<br/>Feature Transform]
B --> C[Kafka Topic<br/>feature-updates]
C --> D[Feature Store<br/>Consumer]
D --> E[Cassandra<br/>Materialized Features]
E --> F[Online Serving<br/>< 10ms latency]
B --> G[Catalog<br/>Lineage Tracking]
| Layer | Technologies |
|---|---|
| Languages | Python 3.9.7, Java 11, Go 1.18 |
| Compute | Ray 2.37.0, Apache Spark 3.5.1 |
| Web Frameworks | FastAPI, Spring Boot, Vert.x |
| Orchestration | Kubernetes (EKS/Kind), Helm 3, KubeRay Operator v1.1.0 |
| Databases | MySQL 8.0, Cassandra 5.0, OpenSearch 2.11 |
| Streaming | Apache Kafka 7.4.0 |
| Storage | S3 (AWS/LocalStack), FSx, EFS |
| Experiment Tracking | MLflow 2.12.2 |
| Monitoring | Prometheus, Grafana, Ray Dashboard |
| Container Registry | ECR, GCR, Local Docker Registry |
Darwin uses a declarative configuration approach:
Interactive wizard to select platform components:
./init.sh
# Prompts for enabling:
# - Applications (Compute, Feature Store, MLflow, etc.)
# - Datastores (MySQL, Cassandra, Kafka, etc.)
# - Ray images and Serve runtimesGenerates .setup/enabled-services.yaml with user selections.
Key configuration via config.env (auto-generated):
KUBECONFIG=./kind/config/kindkubeconfig.yaml
DOCKER_REGISTRY=127.0.0.1:32768Customize deployments via helm/darwin/values.yaml:
global:
namespace: darwin
services:
compute:
enabled: true
replicas: 2
datastores:
mysql:
enabled: true
cassandra:
enabled: true- Single-node Kubernetes cluster
- Local Docker registry
- HostPath-based persistent storage
- Nginx Ingress at
localhost/*
- Multi-AZ high availability
- Mixed spot/on-demand node groups
- Auto-scaling with Karpenter
- Network policies and security groups
- S3-backed artifact storage
- RDS for MySQL (optional)
- Multi-tenant namespace isolation
- Prometheus: Cluster resource utilization, service metrics
- Grafana: Pre-configured dashboards for compute, serving, features
- Ray Dashboard: Job execution, task profiling, resource usage
- Centralized logging via stdout/stderr
- Application logs in
/app/logs - Structured logging with context
- Chronos: Event-driven tracking of all platform operations
- Catalog: Data lineage via OpenLineage
- Elasticsearch-based search and analytics
- Slack integration for cost alerts
- Long-running cluster notifications
- Failed deployment alerts
- darwin-compute: Ray cluster management
- darwin_fs: Feature Store client
- darwin_mlflow: MLflow wrapper with auth
- darwin-workspace (internal): Workspace orchestration
All services expose FastAPI/Spring Boot REST APIs:
- Feature Store:
/feature-store/*,/feature-store-admin/* - Darwin Compute:
/cluster/*,/jupyter/* - ML Serve:
/api/v1/serve/*,/api/v1/artifact/* - Chronos:
/api/v1/event/*,/api/v1/sources/* - Catalog:
/v1/assets/*,/v1/lineage/*
API documentation available at <service-url>/docs (Swagger UI).
Darwin provides pre-built Ray runtimes for cluster creation. The Runtime Name is what you pass to the API when creating clusters (e.g., "runtime": "0.1").
| Runtime Name | Image | Ray Version | Python | Class | Type |
|---|---|---|---|---|---|
| 0.0 | ray:2.37.0 | 2.37.0 | 3.10 | CPU | Ray Only |
| 0.1 | ray:2.53.0 | 2.53.0 | 3.10 | CPU | Ray Only |
Add new Ray runtimes by creating Dockerfiles in darwin-compute/runtimes/:
# darwin-compute/runtimes/cpu/Ray2.37_Py3.11_CustomLibs/Dockerfile
FROM rayproject/ray:2.37.0-py311
RUN pip install jupyterlab==4.3.0
RUN pip install custom-libraryRegister in services.yaml:
ray-images:
- image-name: ray:2.37.0-py311-custom
dockerfile-path: darwin-compute/runtimes/cpu/Ray2.37_Py3.11_CustomLibsCreate Python transformers for event processing:
# Chronos transformer
def transform(event):
return {
"event_type": "cluster_created",
"entities": [{"type": "cluster", "id": event["cluster_id"]}],
"relationships": [{"from": user, "to": cluster, "type": "owns"}]
}See CONTRIBUTING.md for development setup, coding standards, and contribution guidelines.
[License information to be added]
For issues, questions, or feature requests, please open an issue in the repository or contact the platform team.
darwin-distro/
βββ darwin-compute/ # Ray cluster management service
βββ darwin-cluster-manager/ # Kubernetes orchestration (Go)
βββ feature-store/ # Feature Store (Java)
βββ mlflow/ # MLflow experiment tracking
βββ ml-serve-app/ # Model serving platform
βββ artifact-builder/ # Docker image builder
βββ chronos/ # Event processing & metadata
βββ workspace/ # Project & codespace management
βββ darwin-catalog/ # Data catalog & lineage
βββ hermes-cli/ # CLI tool for deployments
βββ helm/ # Helm charts for deployment
β βββ darwin/ # Umbrella chart
β βββ charts/
β β βββ datastores/ # MySQL, Cassandra, Kafka, etc.
β β βββ services/ # Application services
βββ deployer/ # Build scripts and base images
βββ kind/ # Local Kubernetes setup
βββ examples/ # Example notebooks and configs
βββ init.sh # Interactive configuration wizard
βββ setup.sh # Build and cluster setup
βββ start.sh # Deploy platform
βββ services.yaml # Service registry
Darwin ML Platform β Unified, scalable, production-ready machine learning infrastructure.