This release includes a number of new features, improvements, and bug fixes.
New Features
- Automated Security Scanning: Introduced a Cloud Build workflow to automatically perform Helm scans on the ai-on-gke repository using the Shipshape validation service along with documentation to guide users on how to handle security violations identified by Shipshape. (#918, #920)
- 65k Node GKE Benchmark: Added a benchmark for GKE on a simulated AI workload using Terraform and ClusterLoader2, supporting clusters with 65,000 nodes. (#898)
Improvements
- DWS-Kueue Example: Bumped Kueue quotas to 1 billion per resource to provide almost unlimited admission in examples. (#911)
- Infrastructure Module Queued Provisioning: Added queued_provisioning setting on all node pools for compatibility with DWS. (#909)
- TCP Receive Buffer Limit Configuration: Added a daemonset to configure TCP receive buffer limits for improved DCN performance. (#906)
- SkyPilot Tutorial: Improved formatting and fixed typos in the SkyPilot tutorial. (#905, #907)
Documentation Clarity: Clarified instructions on obtaining billing account ID and folder ID. (#914)
Bug Fixes
- E2E Test Flakiness: Increased timeout to mitigate intermittent failures in E2E tests caused by delays in Ray pod startup. (#919)
- Slurm-on-GKE Container Image: Fixed shebang line position in Slurm-on-GKE container image. (#917)
- Directory Path: Fixed an incorrect directory path in the SLUM guide. (#912)
New Contributors
- @volatilemolotov made their first contribution in #905
- @besher-massri made their first contribution in #898
- @janetkuo made their first contribution in #914
Full Changelog: v1.8...v1.9