Important: This documentation covers Yarn 1 (Classic).
For Yarn 2+ docs and migration guide, see yarnpkg.com.

Package detail

@guidebooks/store

guidebooks346Apache-2.08.0.1

The home for importable Guidebooks.

markdown, wizards, documentation

readme

The Guidebook Store

The home for importable Guidebooks.

changelog

8.0.0 (2023-10-11)

Features

  • Update to mcad v1.34.1 support and torchx 0.6.0 (6e82995)

7.10.7 (2023-05-25)

Bug Fixes

  • more EOF protection fixes (8ad9bef)

7.10.6 (2023-05-13)

Bug Fixes

  • ray head init container should print a message when it is done waiting for workers (9e5b79b)

7.10.5 (2023-05-11)

Bug Fixes

  • cpu utilization information may be bogus; switch to cgroup-based stats (a409a7e)
  • increase max log requests for app logs (3d5ca5c)
  • ray head wait-for-workers initContainer should retry if wait fails (2ab27b6)

7.10.4 (2023-05-09)

Bug Fixes

  • multinic detection was broken; also was hard-wiring name of resource (84aacd6)

7.10.3 (2023-05-05)

Bug Fixes

  • custodian logs container fails due to unescaped $ in $TAIL (844de41)

7.10.2 (2023-05-03)

Bug Fixes

  • cache ray/torchx helm chart (eca853d)

7.10.1 (2023-05-03)

Bug Fixes

  • improve torchx support for running multiple gpus per pod (4ab703a)

7.10.0 (2023-05-02)

Bug Fixes

  • syntax error in multinic for torchx (010b5f6)

Features

7.9.17 (2023-05-02)

Bug Fixes

  • ray wait for workers initContainer not needed with 0 workers (4c3683b)

7.9.16 (2023-05-02)

Bug Fixes

  • use initContainer to wait for ray workers (905af3b)

7.9.15 (2023-05-01)

Bug Fixes

  • increase ray gcs rpc timeout to 30s (42ebe6e)

7.9.14 (2023-05-01)

Bug Fixes

  • more EOF resiliency fixes for ray and torchx (0c104cd)

7.9.13 (2023-05-01)

Bug Fixes

  • increase torchx log streaming resilience to network disconnects (3d0b0f3)

7.9.12 (2023-05-01)

Bug Fixes

  • wait for ray workers prior to server-side job submit (6f12b20)

7.9.11 (2023-04-28)

Bug Fixes

  • restore helm delete and increase resilience to network disconnects (c2513e6)

7.9.10 (2023-04-27)

Bug Fixes

  • avoid helm delete in custodian for now (d13558b)

Reverts

  • Revert "fix: avoid use of all-containers in ray log streamer" (4c9f176)

7.9.9 (2023-04-27)

Bug Fixes

  • avoid use of all-containers in ray log streamer (9639e25)

7.9.8 (2023-04-27)

Bug Fixes

  • increase memory for runtime-env custodian pod (1c0996c)

7.9.7 (2023-04-27)

Bug Fixes

  • increase memory for ray head logs container (87ae9cf)

7.9.6 (2023-04-27)

Bug Fixes

  • torchx volume mount paths have extra quotes (ac301ba)

7.9.5 (2023-04-27)

Bug Fixes

  • remove reliance on wget in ray head container (880acd9)

7.9.4 (2023-04-26)

Bug Fixes

  • ignore pycache when bundling up workdir (5e03801)
  • improve custodian memory requests for larger jobs (80d952d)

7.9.3 (2023-04-24)

Bug Fixes

  • improve support for pytorch lightning's fsspec[s3] support (91d2f11)

7.9.2 (2023-04-21)

Bug Fixes

  • add runtime-env-setup to custodian (ab0135a)
  • add worker-status to custodian (be2db0e)
  • do not create gpu custodian container for non-gpu runs (a753d8a)
  • lower memory requests for some of the custodian pods (48560f8)

7.9.1 (2023-04-20)

Bug Fixes

  • eliminate newlines from base64 (697b558)

7.9.0 (2023-04-20)

Bug Fixes

  • lower custodian logs container 100m/128Mi -> 50m/32Mi (18698ee)
  • use multi-line yaml to improve formatting of logs args (0a8de4b)

Features

  • add cpu utilization pod to custodian (be3f788)
  • add gpu utilization pod to custodian (5f4f74b)
  • add memory utilization pod to custodian (e0127d7)

7.8.5 (2023-04-19)

Bug Fixes

  • clean up custodian command, and rename container 'logs' (cb81a51)

7.8.4 (2023-04-18)

Bug Fixes

  • torchx cluster name may end with a dash (351df63)

7.8.3 (2023-04-18)

Bug Fixes

  • owner label default needs to be quoted (16b13d5)

7.8.2 (2023-04-18)

Bug Fixes

  • add app.kubernetes.io/owner label to pods (2872b1d)

7.8.1 (2023-04-17)

Bug Fixes

  • add 'app.kubernetes.io/managed-by: codeflare' label to custodian (701b8ab)

7.8.0 (2023-04-17)

Features

  • improve custodian support for torchx, use smaller base image (6bd38ef)

7.7.1 (2023-04-17)

Bug Fixes

  • logs custodian has errors with tee'ing to file (202a5a1)
  • logs custodian should pull from kubectl logs, not ray job logs (d866d45)

7.7.0 (2023-04-17)

Features

  • rename self-destruct to logs; and increase ttl timeout on its job (595014f)

7.6.3 (2023-04-13)

Bug Fixes

  • final Succeeded message not shown in ray jobs (f2280b8)

7.6.2 (2023-04-13)

Bug Fixes

  • further improvements to ray log streaming (11a7359)

7.6.1 (2023-04-13)

Bug Fixes

7.6.0 (2023-04-12)

Features

  • avoid websocat in ml/ray/run/logs (7c6bbb1)

7.5.19 (2023-04-12)

Bug Fixes

  • decrease epochs from 5 to 2 for getting started ray example (6c4ba16)
  • ray labels were using /name should use /instance (d7b7651)
  • websocat ray log streaming can be simplified (81349f7)

7.5.18 (2023-04-10)

Bug Fixes

  • vmstat data lacks pod/ prefix on pod name (c9a3350)

7.5.17 (2023-04-09)

Bug Fixes

  • ray jobs emit job env.json only after job is running (277108f)

7.5.16 (2023-04-07)

Bug Fixes

  • improve messaging of torchx wait-till-running (864d45c)
  • pod-memory stream lacked pod/ prefix for hostname (384058e)
  • torchx wait-till-running was not waiting till all workers were running (4b43524)

7.5.15 (2023-04-07)

Bug Fixes

  • torchx env isn't written out till the job is already running (3263606)

7.5.14 (2023-04-07)

Bug Fixes

  • capture job env vars for torchx runs (0cbe657)

7.5.13 (2023-04-07)

Bug Fixes

  • torchx captured logs may not include Succeeded/Failed events (e52abad)

7.5.12 (2023-04-07)

Bug Fixes

  • syntax error in code block in torchx status poller (02093d0)
  • torchx exit handlers were not right (559d2df)

7.5.11 (2023-04-06)

Bug Fixes

  • remove leftover 'set -x' from debugging (953d67c)
  • small refinements to torchx logs (3a21b49)
  • torchx job status file needs to use tee -a to append (b56decc)

7.5.10 (2023-04-06)

Bug Fixes

  • improve torchx status events to show Job status (e74480c)
  • improved event handling for torchx exit (fa75d8a)

7.5.9 (2023-04-06)

Bug Fixes

  • torchx jobs lacked kube event stream (7000d3b)

7.5.8 (2023-04-06)

Bug Fixes

  • torchx script logic fails if python prefix is not python3 (942304f)

7.5.7 (2023-04-05)

Bug Fixes

  • clean up content and coloring of helm install output (6b1b13d)

7.5.6 (2023-04-05)

Bug Fixes

  • torchx cli install fails on zsh (c027765)

7.5.5 (2023-04-05)

Bug Fixes

  • sed RE error can occur in torchx log streamer (fd9983d)

7.5.4 (2023-04-05)

Bug Fixes

  • pass through guidebook env vars to torchx (780dc56)

7.5.3 (2023-04-04)

Bug Fixes

  • ml/torchx/run may fail for users with long user names (a56b623)

7.5.2 (2023-04-04)

Bug Fixes

  • don't fail if we can't hack uid-range (d45caaf)
  • torchx log streamer would fail if lines contained control chars (2292c03)
  • update to official torchx 0.5.0 release (7be9130)

7.5.1 (2023-04-04)

Bug Fixes

  • in CI, don't try to use ssh git cloning for workdir (b0a69f8)

7.5.0 (2023-04-03)

Bug Fixes

  • ml/torchx/run fails if main python file is not 'main.py' (b36de03)

Features

  • add support for workdir being a github https:// url (e835081)

7.4.8 (2023-04-03)

Bug Fixes

  • another fix for relative workdir (3a8782f)

7.4.7 (2023-04-03)

Bug Fixes

  • further improvements to helm install with relative workdir (79bd176)
  • improved support for installing and running torchx on 3.9.6 on macOS (1bea28f)

7.4.6 (2023-04-03)

Bug Fixes

  • force vmstat timestamps to use UTC timezone (6393448)

7.4.5 (2023-04-01)

Bug Fixes

  • capture env.json in log aggregation (7045053)

7.4.4 (2023-03-31)

Bug Fixes

  • another fix to improve syntactic conformance of gpu utilization stream (501b0d8)

7.4.3 (2023-03-31)

Bug Fixes

  • gpu stream displays temps with % unit (cf877d8)

7.4.2 (2023-03-31)

Bug Fixes

  • update gpu utilization stream to conform to vmstat and events log structure (cb11acc)

7.4.1 (2023-03-31)

Bug Fixes

  • kubectl linux-arm64 installs arm32 binary (0889389)

7.4.0 (2023-03-31)

Bug Fixes

  • bump to madwizard@8 to adopt shell.stdin convention (e500089)

7.3.1 (2023-03-31)

Reverts

  • Revert "feat: ml/codeflare/training/byoc and ml/torchx/run should not kill on ctrl+c" (371f6bd)

7.3.0 (2023-03-31)

Bug Fixes

  • with MAX_WORKERS=0 ray worker job still starts with completions=1 (1ba6c3e)

Features

  • ml/codeflare/training/byoc and ml/torchx/run should not kill on ctrl+c (e5d95a3)

7.2.10 (2023-03-30)

Bug Fixes

  • dial down mcad resources for CI a bit more (0bbc31b)

7.2.9 (2023-03-30)

Bug Fixes

  • disable securityContext for ray (f3e0fe5)

7.2.8 (2023-03-30)

Bug Fixes

  • avoid the use of dim blue, as many terminal themes aren't happy with it (f7ccc58)

7.2.7 (2023-03-30)

Bug Fixes

  • use red for debug output from ray head and workers (5685108)

7.2.6 (2023-03-30)

Bug Fixes

  • bucket enumeration fails if access keys have trailing whitespace (f7b271d)
  • improve debuggability of ray head crashes (ff02be7)
  • kubernetes/kubectl support for linux/arm (2b9a8ee)
  • worker should retry a few times on failure (0ee8f0c)

7.2.5 (2023-03-30)

Bug Fixes

  • don't ask for bucket name for s3fs-only (7b4aff6)

7.2.4 (2023-03-30)

Bug Fixes

  • improve stability oƒ ray port forward (8c9b8b6)
  • improve support for running locally on arm (b42d718)

7.2.3 (2023-03-29)

Bug Fixes

  • ml/ray/start/kubernetes/events uses incorrect awk comment character (b2352c4)

7.2.2 (2023-03-29)

Bug Fixes

  • add workaround for torchx dryrun regression (24d04dd)
  • improvements in error handling in ml/torchx/run/stop (2bfe483)

7.2.1 (2023-03-29)

Bug Fixes

7.2.0 (2023-03-28)

Bug Fixes

  • assign gpus to ray head, if we are assigning them to workers (9fa3307)
  • bump openshift/oc to 4.12.9 (55d0ff1)
  • kubernetes/choose/ns validator syntax fix (3373061)
  • log setup ends up waiting for job to be running (d6eaa49)
  • major cleanups to job log streaming (83771c6)
  • ml/raay/start/kubernetes/wait-for-head/workers cpu spin (0f85741)
  • ml/ray/stop/kubernetes may wait unnecessarily (289a749)
  • squash irrelevant errors in vmstat, etc. tracking (9c92291)
  • try to be more helpful on windows for kubectl and oc installation (cb6b9ac)

7.1.2 (2023-03-21)

Bug Fixes

  • improve handling of no aws profiles/no buckets situations (30be07b)
  • ml/codeflare/training/demos/getting-started/submit should offer submit-only (e2da046)

7.1.1 (2023-03-21)

Bug Fixes

  • if .aws/config does not define endpoint_url, bucket expansion fails (05140f0)

7.1.0 (2023-03-20)

Features

  • remove old roberta, bert, and glue logic (180a1ea)

7.0.1 (2023-03-20)

Bug Fixes

  • ml/torchx/run always calls set -x even when not dry-running (a0491cc)

7.0.0 (2023-03-19)

Features

  • support for selecting multiple s3 buckets (9ec8528)

BREAKING CHANGES

  • this updates s3/choose/bucket from single-select to multi-select, which may require re-selecting this choice

6.2.1 (2023-03-18)

Bug Fixes

  • ml/torchx/run fails with fractional mem e.g. 1.5Gi (2f03999)

6.2.0 (2023-03-18)

Features

  • consolidate ml/torchx resources and ml/ray resources (081d19e)

6.1.10 (2023-03-17)

Bug Fixes

  • gpu utilization does not stream out when late-attaching (be0559f)

6.1.9 (2023-03-16)

Bug Fixes

  • ml/torchx/run/stop fails to delete helm (1b68b35)

6.1.8 (2023-03-16)

Bug Fixes

  • ml/torchx/stop incorrectly attempted in dryrun (26f55e6)
  • pass through job priority to torchx run (5455f9c)
  • remove incorrect "Ray" in kubernetes/choose/ns title (16c39c9)

6.1.7 (2023-03-16)

Bug Fixes

  • simplify ray by avoiding ray workflows question for now (ba44bea)

6.1.6 (2023-03-16)

Bug Fixes

  • after switch from bzip2 to gzip, we need to rename workdir.tar.bz2 (19d6ae0)

6.1.5 (2023-03-16)

Bug Fixes

  • avoid use of bunzip2, in favor of gzip (cf25a2b)

6.1.4 (2023-03-14)

Bug Fixes

  • avoid use of -Winteractive on macos (6a89cad)
  • gpu utilization may not be streamed out (9925290)
  • when looping kubectl events, use watch-only after first iter (25a6511)

6.1.3 (2023-03-14)

Bug Fixes

  • ray self-destruct should establish security context (d6ed018)

6.1.2 (2023-03-14)

Bug Fixes

  • helm workdir fails for relative paths (1350e49)
  • we were looking for ERROR instead of FAILED for failed ray jobs (8a4d332)

6.1.1 (2023-03-14)

Bug Fixes

  • if ray process errors out, ray stop is never called (70cb403)
  • ray self-destruct may fail due to permissions (e0f41ab)
  • user may be prompted for run id, when it is already known (6a290e4)

6.1.0 (2023-03-14)

Bug Fixes

  • also give ray head a gpu (3a9536b)

Features

  • aws/init to guide the creation of an aws profile (9ab2c47)

6.0.11 (2023-03-13)

Bug Fixes

  • another fix for absent runtime-env (7797638)

6.0.10 (2023-03-13)

Bug Fixes

  • update to new torchx use of torchrun (cdf9e01)

6.0.9 (2023-03-13)

Bug Fixes

  • specify parallelism for ray worker Job (6a8e89d)

6.0.8 (2023-03-13)

Bug Fixes

  • improve support for milli cpus in ml/torchx/run (3469edc)

6.0.7 (2023-03-13)

Bug Fixes

  • kube event streamer should retry on 404 (b76dbb7)

6.0.6 (2023-03-13)

Bug Fixes

  • add kube event stream to ml/torchx/run (b78a035)

6.0.5 (2023-03-13)

Bug Fixes

  • add dashdash support to ml/torchx/run (ebf0a7e)
  • remove extra space in helm install output (f9133c2)
  • remove set -x (cb677f1)

6.0.4 (2023-03-12)

Bug Fixes

  • don't pass through volumes and image pull secret if not provided (9bf76d6)

6.0.3 (2023-03-12)

Bug Fixes

  • missing -E for torchx memMB; also add case-insensitivity (9fddc45)

6.0.2 (2023-03-12)

Bug Fixes

  • allow ml/torchx/run to choose pod scheduler (f3394d4)
  • ml/torchx/run misinterprets Mi memory units (00b2c77)
  • remove unnecessary mention of 'ray' in helm install output (a6925c9)

6.0.1 (2023-03-12)

Bug Fixes

  • use 0.5.0dev0 has the name for the torchx venv (b1b760b)

6.0.0 (2023-03-12)

Features

  • add torchx/run to ml/codeflare/run (92470f7)

BREAKING CHANGES

  • this changes the menu structure for ml/codeflare/run, which may require test updates.
  • this also removes the old "codeflare model architecture" option from ml/codeflare/run

5.7.0 (2023-03-11)

Bug Fixes

  • add worker index to log output (6d8b12a)
  • ml/torchx/run fails with nWorkers>1 (377d183)
  • shorten prefix of log lines (250f486)
  • small wordsmithing to title of ml/torchx/run/stop (ef16c61)

Features

  • add cpu/gpu/mem utilization to ml/torchx/run/logs (8a23455)
  • ml/torchx/run (9ed5a10)
  • restore torchx cli install (53d6ba5)

5.6.2 (2023-03-09)

Bug Fixes

  • ray would fail if workdir had no runtime-env.yaml nor requirements.txt (a985e0e)

5.6.1 (2023-03-07)

Bug Fixes

  • ugh, restore awk -Winteractive (3057937)

5.6.0 (2023-03-07)

Features

  • move cpu and memory utilization to be first column (ca00dc4)

5.5.6 (2023-03-07)

Bug Fixes

  • vmstats were using sed -l in a way that only made sense on BSD sed (c30da29)

5.5.5 (2023-03-07)

Bug Fixes

  • helm install was not passing through desired command line prefix (75026f1)

5.5.4 (2023-03-06)

Bug Fixes

  • avoid use of env vars for workdir tarball (6dd6c4b)

5.5.3 (2023-03-06)

Bug Fixes

  • correct base64 decoding from prior commit (035d604)
  • eliminate newlines from base64-encoded strings (56bf44e)

5.5.2 (2023-03-06)

Bug Fixes

  • helm install command line may be too long (3a03b7a)

5.5.1 (2023-03-06)

Bug Fixes

  • further improvements to streaming output of vmstat on linux (2a975ec)
  • improve visibility of RAY_ADDRESS port forward announcement (ec045a4)

5.5.0 (2023-03-05)

Features

  • allow BYOC to specify command line prefix (99d67d7)

5.4.10 (2023-03-05)

Bug Fixes

  • avoid use of grep in pod-vmstat and kube events to improve streaming on linux (9660ec1)

5.4.9 (2023-03-04)

Bug Fixes

  • kube events lines are way too wide (8ce7175)

5.4.8 (2023-03-03)

Bug Fixes

  • sed invalid reference errors (e6ae764)
  • simplify ray logs to focus on websocat (e81009c)

5.4.7 (2023-03-03)

Bug Fixes

  • allow workdirs to specify exclusion rules via .rayignore (f2a7fc3)
  • decrease the frequency of nvidia-smi from 2s to 10s (19bf7df)
  • improve messaging after ray job submit (58a6f76)
  • ml/ray/run/vmstat may try to exec into Successful pod (37e7b8e)

5.4.6 (2023-03-03)

Bug Fixes

  • linux base64 creates newlines (946fb9e)

5.4.5 (2023-03-03)

Bug Fixes

  • use set-file for job env in helm install (550778e)
  • use set-file for the base64-encoded workdir (18d5b48)

5.4.4 (2023-03-03)

Bug Fixes

  • remove unused HELM_DEBUG in helm install script (185a892)

5.4.3 (2023-03-03)

Bug Fixes

  • hmm, some of the tputs cause problems in github actions (615946a)

5.4.2 (2023-03-03)

Bug Fixes

  • improved error handling in helm install for lack of cluster name (94e2720)

5.4.1 (2023-03-03)

Bug Fixes

  • improved error handling when custom workdir not found (e850fce)

5.4.0 (2023-03-03)

Features

  • submit ray job in helm chart (c20de16)

5.3.4 (2023-03-02)

Bug Fixes

  • add a byoc/submitonly task (d49bcd1)

5.3.3 (2023-03-01)

Bug Fixes

  • improved error handling in log streaming for pods disappearing (5595911)

5.3.2 (2023-03-01)

Bug Fixes

  • make sure not to set priorityClassName if not using job priorities (3253b50)

5.3.1 (2023-03-01)

Bug Fixes

  • treat default scheduling priority as not enabling prioritization (f762ae8)

5.3.0 (2023-03-01)

Bug Fixes

  • gpu-utilization.md may fail on zsh if profile specifies no gpus (90a7ac9)

Features

  • allow selection of scheduling priority for MCAD (7a7eda9)

5.2.3 (2023-03-01)

Bug Fixes

  • remove use of GUIDEBOOK_DASHDASH (bd120f8)

5.2.2 (2023-03-01)

Bug Fixes

  • increase ray head+worker wait timeout (46e0b27)

5.2.1 (2023-03-01)

Bug Fixes

  • ray helm chart podgroup had a 10 second scheduling timeout (e916c1a)

5.2.0 (2023-02-28)

Features

  • allow use of vmstat and gpu utilization tracking outside of ray (2a4b7a4)

5.1.7 (2023-02-23)

Bug Fixes

  • utilization was flowing to both stderr and stout (75c2cb2)

5.1.6 (2023-02-23)

Bug Fixes

  • ml/ray/run utilization info should flow to stderr (bbf2c4d)

5.1.5 (2023-02-23)

Bug Fixes

  • ml/ray/run/logs/via-websocat fails on zsh due to assignment to status (72fdec7)

5.1.4 (2023-02-22)

Bug Fixes

  • ray cleaner might stick around, and may consume lots of cpu/mem (0f88584)

5.1.3 (2023-02-22)

Bug Fixes

  • ray self-destruct should use image pull secret (f1ebf89)

5.1.2 (2023-02-22)

Bug Fixes

  • linux websocat installation does not create /usr/local/bin/websocat (f4b8ed4)

5.1.1 (2023-02-22)

Bug Fixes

  • improve ml/ray/run/job-definition error handling (040f228)
  • ml/ray/run/logs/via-websocat may prematurely exit and should mimic ray job submit success message (1bb3dbc)
  • ml/ray/run/vmstat fails due to missing line continuations (ffdfda8)
  • update ml/ray/run/vmstat to use consistent timestamp format and user's timezone (4269261)
  • util/websocat may fail on linux due to permissions (d6b2be2)

5.1.0 (2023-02-22)

Features

  • update ml/codeflare/training/byoc to use improved logs (02d1f29)

5.0.0 (2023-02-21)

Features

  • remove ray autoscaler option in ml/ray/cluster/kubernetes/choose-pod-scheduler (5db8f4c)

BREAKING CHANGES

  • this removes support for using the ray operator. There is some breaking skew in ray 2 that we have not accounted for in our fork of the ray helm chart.

4.0.5 (2023-02-21)

Bug Fixes

  • ray kube label selectors were not discriminating by job (ddad4d8)

4.0.4 (2023-02-21)

Bug Fixes

  • allow tests to specifying name of ray-head service (9f16453)

4.0.3 (2023-02-20)

Bug Fixes

  • another fix for pod-vmstat colors (7801065)

4.0.2 (2023-02-20)

Bug Fixes

  • restore colors in pod-vmstat (b1d5e42)
  • some ray kubernetes resource names may exceed 63 chars (642c842)

4.0.1 (2023-02-20)

Bug Fixes

  • add missing free memory to pod-vmstat, and missing timestamp to pod-vmstat-memory (041215c)

4.0.0 (2023-02-20)

Features

  • improve console readability of ml/ray/run/pod-vmstat and memory stats (e5fd5a6)

BREAKING CHANGES

  • this breaks any clients that were assuming a normal vmstat format for the cpu stats. it is now limited to the cpu columns (i.e. us sy id wa st).

3.3.12 (2023-02-19)

Bug Fixes

3.3.11 (2023-02-19)

Bug Fixes

  • ml/ray/run/logs/via-websocat may fail with errno 22 (0739192)

3.3.10 (2023-02-19)

Bug Fixes

  • ml/ray/aggregator/with-jobid does not prepare ray cluster (6d0d8c4)

3.3.9 (2023-02-17)

Bug Fixes

  • ray-api port-forward should use a pod-running-timeout (7ae6e13)

3.3.8 (2023-02-17)

Bug Fixes

  • ray helm chart names may have invalid uppercase characters (16c6c3c)

3.3.7 (2023-02-17)

Bug Fixes

  • ray helm chart names may exceed 53 characters (41c878b)

3.3.6 (2023-02-16)

Bug Fixes

  • self-destruct rbacs should be versioned (1fa9e75)

3.3.5 (2023-02-16)

Bug Fixes

  • squash errors from helm auto-cleanup (4f0e16a)

3.3.4 (2023-02-16)

Bug Fixes

  • self-destruct does not properly clean up on job termination (ece4711)

3.3.3 (2023-02-16)

Bug Fixes

  • self-destruct cleanup failure should not cause overall failure (cb6399b)

3.3.2 (2023-02-16)

Bug Fixes

  • ray self-destruct fails with $USER is not defined (0aae213)
  • ray self-destruct should clean up after itself (0e4e625)

3.3.1 (2023-02-16)

Bug Fixes

  • restore finally in ml/codeflare/training/byoc (654ead7)

3.3.0 (2023-02-16)

Features

  • ray helm chart should tear itself down when job completes (3fa402b)

3.2.0 (2023-02-15)

Features

  • clean up ml/ray/start/kubernetes (91a1f9e)
  • discontinue import of ray cli for ml/ray/start (1acec02)

3.1.1 (2023-02-14)

Bug Fixes

  • ml/codeflare/tuning/glue/submit lacks closing triple-backtick (c0c6bf3)

3.1.0 (2023-02-14)

Bug Fixes

  • remove unnecessary import of ml/ray/cli/install (a474adb)

Features

  • remove funky ml/torchx/install/cli with its odd ray cli dependence (4693c31)
  • remove ml/ray/stop/local (be2039a)

3.0.0 (2023-02-14)

Bug Fixes

  • further porting to byoc-style (9082933)

Features

  • remove ancient ml/ray/run/.../examples and python/pip/... (b12dce7)

BREAKING CHANGES

  • removal of guidebooks

2.3.6 (2023-02-14)

Bug Fixes

  • port ml/codeflare/training/demos/getting-started to simpler byoc-style (d38c9bf)
  • ray service was not selective enough (92dd13c)

2.3.5 (2023-02-13)

Bug Fixes

  • add app.kubernetes.io/name label to ray helm chart worker and head (7135279)
  • we were hard-coding ray v2 for ml/ray/run/choose (68f30fc)

2.3.4 (2023-02-10)

Bug Fixes

  • ray server-side log aggregator emits debug/verbose output on client (81fb933)

2.3.3 (2023-02-10)

Bug Fixes

  • improve message announcing ray api port forward (e7e7f79)
  • ml/ray/run/choose should wait a bit for ray job to be active (01eba4a)

2.3.2 (2023-02-09)

Bug Fixes

  • ray in-cluster attach had bit rotted (f517379)

2.3.1 (2023-02-07)

Bug Fixes

  • update default ray cluster name to user $USER (ccc00f8)

2.3.0 (2023-02-02)

Features

  • allow ray jobs to specify ephemeral storage requirements (58c456a)

2.2.10 (2023-02-02)

Bug Fixes

  • avoid pvc name collisions e.g. pvc-0 (0d35514)
  • ray head node should also mount s3fs data (98f0df2)

2.2.9 (2023-02-01)

Bug Fixes

  • ml/ray/stop/kubernetes/with-known-cluster-name should check if it has a ray cluster name (bfc3810)

2.2.8 (2023-02-01)

Bug Fixes

  • kube events no longer stream (c68ffdf)

2.2.7 (2023-02-01)

Bug Fixes

  • inconsistent use of dots and dashes in pvc claim name (9c64c34)

2.2.6 (2023-02-01)

Bug Fixes

  • use echo -n when base64-encoding secrets (6885257)

2.2.5 (2023-02-01)

Bug Fixes

  • improved filtering of kubernetes events (60cee25)
  • kube volume name may exceed 63 chars (9b807b9)
  • ml/ray/start/kubernetes should only track events for my job (2beb22b)
  • s3fs secrets need to be base64 encoded (51ef88f)

2.2.4 (2023-02-01)

Bug Fixes

  • move s3fs pvc and secret to helm chart (d12fc5b)
  • pvc name should use jobid not cluster name (eed5f8e)

2.2.3 (2023-02-01)

Bug Fixes

  • pvc configuration does not handle spaces in mountPath (da1bb9a)
  • s3/choose/s3fs/kubernetes may fail to create pvc (c7afbba)
  • s3fs secret and claim name should hash in cluster name (8246d4f)

2.2.2 (2023-01-31)

Bug Fixes

  • add missing choice description for s3/choose/s3fs/storage-class (3a3c905)

2.2.1 (2023-01-31)

Bug Fixes

  • bump to madwizard ^4.5.0 to pick up :::: fix (34b0390)

2.2.0 (2023-01-31)

Features

  • allow users to choose storage class in s3/choose/s3fs/kubernetes (3bb90ce)

2.1.2 (2023-01-17)

Bug Fixes

  • ray helm chart syntax error for imagePullSecrets (56f59e7)

2.1.1 (2023-01-17)

Bug Fixes

  • helm uninstall (ray stop) issued immediately after helm install can leave dangling helm (a0a2712)
  • work around finally ordering issue (ac9ca4b)

2.1.0 (2023-01-17)

Bug Fixes

  • minor change in import kubectl.md -> kubectl (6c60ef7)

Features

2.0.3 (2023-01-13)

Bug Fixes

  • install-via-helm should allow user to request dry-run (c5cbecc)

2.0.2 (2023-01-12)

Bug Fixes

  • ml/codeflare/training/demos/getting-started doesn't tear down ray (4913b1a)
  • ml/ray/start helm chart schema violation for imagePullSecret (87daba9)

2.0.1 (2023-01-11)

Bug Fixes

  • kubernetes/choose/ns should also filter out calico-system and tigera-operator (d016a2d)

2.0.0 (2023-01-11)

Bug Fixes

  • kubernetes image-pull can hang if cluster is not reachable (76f9ecc)

BREAKING CHANGES

  • rename kubernetes/secrets/image-pull to kubernetes/choose/secret/image-pull

1.11.5 (2023-01-10)

Bug Fixes

  • ugh, bumper again messed up install-via-helm.sh (e995a28)

1.11.4 (2023-01-10)

Bug Fixes

  • add more missing description paragraphs (11c5991)

1.11.3 (2023-01-10)

Bug Fixes

  • improved description text for byoc and roberta (1e6a5d9)

1.11.2 (2023-01-10)

Bug Fixes

  • ml/ray/run/pod-vmstat-memory does not support cgroup v2 api (78cd603)

1.11.1 (2023-01-09)

Bug Fixes

  • ml/codeflare/training/byoc may not stop ray cluster automatically (944676a)
  • ray helm chart install was not passing through imagePullSecret (3694ce4)

1.11.0 (2023-01-09)

Bug Fixes

  • improve wording of cancel in ml/ray/stop/kubernetes (ad6ce31)

Features

  • initial support for image pull secrets (19626e8)

1.10.1 (2023-01-09)

Bug Fixes

  • s3fs guidebooks may result in asking the s3/choose/instance question twice (923b74f)

1.10.0 (2023-01-06)

Bug Fixes

  • minor tweak to demo guidebooks to leverage $choice variable (c77fe92)

Features

  • generalize s3fs support (682388d)

1.9.4 (2022-12-25)

Bug Fixes

  • flesh out the demo/multi descriptions and choices (3019689)

1.9.3 (2022-12-17)

Bug Fixes

  • ml/ray/install/cli assumed venv support was installed (3c98a48)

1.9.2 (2022-12-12)

Bug Fixes

  • add form and multiselect examples to demo guidebook (ffafa82)

1.9.1 (2022-12-12)

Bug Fixes

  • switch from madwizard-cli to madwizard-cli-core (2ff8caa)

1.9.0 (2022-12-08)

Features

  • switch from madwizard to madwizard-cli for madwizard cli (f747168)

1.8.0 (2022-12-08)

Bug Fixes

Features

  • add s3/minio to facilitate standing up minio in kubernetes (2baa4d2)

1.7.1 (2022-12-02)

Bug Fixes

  • add missing title to ml/ray/storage/s3/maybe (9109e1d)

1.7.0 (2022-12-02)

Bug Fixes

  • handle 0-worker case for ray (9b94f4e)

Features

  • add s3 secrets to ray head deployment so that --storage=s3:// works (355d230)

1.6.1 (2022-12-02)

Bug Fixes

  • leftover reference to using /tmp for ray storage (7aea102)
  • ml/ray/start should not need torchx (4f24298)

1.6.0 (2022-12-01)

Features

  • update ml/ray/cluster/choose to use madwizard env-keyed choice (cc87842)

1.5.0 (2022-12-01)

Features

  • install ray as a venv that is managed by us (7eba971)

1.4.3 (2022-12-01)

Bug Fixes

  • ml/ray/cluster/choose does not work against ray's helm chart (4ad60e4)
  • ray charts use fixed name for rbac resources (7dc8834)
  • use /dev/shm for ray --storage to allow sharing for workflows (eae870f)

1.4.2 (2022-12-01)

Bug Fixes

  • ml/ray/stop may not list ray clusters (9934933)

1.4.1 (2022-11-30)

Bug Fixes

  • my/ray/cluster/choose was hard-wiring kube context and ns (ee28d56)

1.4.0 (2022-11-30)

Features

  • ray cluster name should vary by job (b2ac6cf)

1.3.0 (2022-11-30)

Features

  • allow use of ibmcloud codeengine for kubernetes context selection (235d006)

1.2.3 (2022-11-29)

Bug Fixes

  • improve description for kubernetes/choose/ns (6ff5c2c)
  • improve description for roberta choose-data (47fdb9c)
  • some optimizations for ibmcloud expansions (43a9d05)

1.2.2 (2022-11-29)

Bug Fixes

  • ml/ray/codeflare/training/roberta used base image lacking torch (f07c9a9)

1.2.1 (2022-11-28)

Bug Fixes

  • ml/codeflare/training/roberta was using non-gpu ray base image (312ad1f)

1.2.0 (2022-11-28)

Bug Fixes

  • sigh, release-it/bumper replaced '/1000' with '.1.0' (277cef1)

1.1.1 (2022-11-27)

Bug Fixes

  • force ml/codeflare/training/roberta to use ray v1 (17041e2)

1.1.0 (2022-11-23)

Bug Fixes

  • make ray operator base image consistent with main base image (0e67a5a)

Features

  • add ml/ray/v1 to allow use cases to insist on using ray v1 (236d493)

1.0.0 (2022-11-23)

Bug Fixes

  • set storage path in ray head start (43a7544)

Features

  • bump default ray image to 2.1 (from 1.13.0) (b2f8e11)

BREAKING CHANGES

  • this is a major update to the ray api

0.18.2 (2022-11-23)

Bug Fixes

  • back out adding RAY_ADDRESS to head pod (65dfb61)

0.18.1 (2022-11-22)

Bug Fixes

  • ml/ray/start/kubernetes/port-forward/ray causes cpu spinloop (59593af)

0.18.0 (2022-11-22)

Features

  • add mcad completionstatus field to ray worker and head jobs (7dc2874)
  • automatically shut down ray upon completion, for roberta and byoc (19c5ed9)
  • bump mcad to pick up completionstatus support (0d511b6)
  • switch from Deployment to Job for ray (425e5f2)

0.17.7 (2022-11-18)

Bug Fixes

  • refinements to title and description of ml/codeflare/training/byoc (9ef0ad6)

0.17.6 (2022-11-18)

Bug Fixes

  • bump to madwizard 1.8.x to pick up description fixes (d82538e)

0.17.5 (2022-11-18)

Bug Fixes

  • improved descriptions for ml/codeflare and ml/codeflare/run (c84e290)

0.17.4 (2022-11-11)

Bug Fixes

  • kubernetes/mcad/choose/scheduler should allow tests to set default scheduler (30edda8)

0.17.3 (2022-11-10)

Bug Fixes

  • ml/tensorboard/start/kubernetes/install.sh has typos w.r.t. branch checkout (0f68602)

0.17.2 (2022-11-10)

Bug Fixes

  • workaround for madwizard bug in ml/codeflare/training/roberta/choose-data-public (72a406f)

0.17.0 (2022-11-09)

Features

  • split out ray advanced topics from ml/codeflare (8c21d30)

0.16.0 (2022-11-09)

Bug Fixes

  • print out port or port forward (bbd218c)
  • remove question marks from ml/codeflare and ml/codeflare/run titles (ada0e0f)
  • update the terminology around choosing s3 credentials (d6abd74)
  • use coscheduler for preinstalled mcad option (2604681)

Features

  • add ml/torchx/install/cli (1e96ccd)
  • add RAY_ADDRESS env var to ray head (d6b0a6a)
  • add support for torchx run (e523eac)

0.15.2 (2022-10-26)

Bug Fixes

  • roberta s3fs did delete+create; replace with validator (4039e4a)

0.15.1 (2022-10-25)

Bug Fixes

  • due to madwizard bug, aws/auth does not work (c68be15)

0.15.0 (2022-10-25)

Bug Fixes

  • add validators to aws/install/cli (53114c0)
  • s3fs pvc used deprecated storageclass (f77bd0c)
  • util/spinner does not properly update and erase spinner (9482107)
  • util/spinner may not hide cursor in non-pty stdouts (c77fa4d)

Features

0.14.3 (2022-09-28)

Bug Fixes

  • improved error messages for docker/install (197ba62)
  • update kubernetes/kind to install kind if needed (6d296ee)

0.14.2 (2022-09-27)

Bug Fixes

  • missing title for ml/codeflare/training/demos (30cedd3)

0.14.1 (2022-09-27)

Bug Fixes

  • kubernetes/choose/ns shows up in profile as kubernetes/choose/ns-with-context (d910ced)

0.14.0 (2022-09-26)

Bug Fixes

  • ml/ray/cluster/kubernetes/choose-pod-scheduler was not properly importing mcad choice (3eb1192)

Features

  • remove untested ray local remnants (b1bf712)

0.13.1 (2022-09-25)

Bug Fixes

  • bump to madwizard 1.0.2 to pick up aprioris fixes (080737a)

0.13.0 (2022-09-24)

Bug Fixes

  • ml/ray/start/kubernetes expresses import dependencies that aren't needed (968dd9f)
  • ml/ray/start/kubernetes may leave "kubernetes.txt" in CWD (de3c273)
  • roberta with sample input should use /tmp for logdir (8523223)

Features

  • add kubernetes/choose/ns-with-context to allow choice of ns with an a priori context choice (f20a69e)
  • add util/envsubst and util/envsubst2 (7aed3f0)
  • bump to madwizard 1.0.0 which mandates use of imports versus inlining (1e1b943)

0.12.6 (2022-09-19)

Bug Fixes

  • pin openshift oc version to 4.10.33 (3a2a08d)

0.12.5 (2022-09-19)

Bug Fixes

  • ml/codeflare/training/roberta fails to fetch sample data in non-ibm clouds (d48bd70)
  • ml/codeflare/training/roberta sparse clone fails to checkout (88ae463)

0.12.4 (2022-09-16)

Bug Fixes

  • use pip3 rather than pip to install deps (3ca3fc6)

0.12.3 (2022-09-16)

Bug Fixes

  • sparse checkout branches need not be main/master (f1ef3d4)

0.12.2 (2022-09-16)

Bug Fixes

  • sparse checkout update uses wrong branch for mcad and coschedulerE (63b89b2)

0.12.1 (2022-09-16)

Bug Fixes

  • improve git compatibility with sparse checkout (7c6cbdf)

0.12.0 (2022-09-16)

Features

  • discontinue support for ray local (ray kube only) (431dc11)

0.11.3 (2022-09-15)

Bug Fixes

  • enforce more restrictive securityContext for ray pods (c89073a)

0.11.2 (2022-09-15)

Bug Fixes

  • add more PATH discovery options to ml/ray/install/cli (a32299a)

0.11.1 (2022-09-15)

Bug Fixes

  • adjust default ray cluster resources to be less demanding (423b107)
  • another fix for re-installing the ray cli (66b52df)
  • if user selects "create namespace", then profile can get stuck (87f9048)
  • kubernetes/choose/ns does not set KUBE_NS_ARG for "create namespace" option (bb41e24)
  • on macOS, ray may still not be on PATH even after installation of the ray CLI (0fc24c2)

0.11.0 (2022-09-14)

Features

  • add ml/codeflare/training/demos/getting-started/s3 (199f402)

0.10.3 (2022-09-14)

Bug Fixes

  • kubectl context and namespace args have to be after verb (4deefc5)

0.10.2 (2022-09-02)

Bug Fixes

  • /ml/codeflare/training/roberta/choose-data-public sets wrong variable (1021cd4)
  • avoid detached HEAD warnings from ray helm git clone (9fc9b0f)

0.10.1 (2022-09-02)

Bug Fixes

  • pvc delete command missing $ in $ML_CODEFLARE_ROBERTA_S3FS_CLAIM (d9efd2b)

0.10.0 (2022-09-02)

Bug Fixes

  • allow ray workers to use as much cpu as they need, if available (6b96256)
  • back out prior removal of cpu limit (983bb5c)
  • ml/ray/start/kubernetes/install-via-helm.sh needs to pin a version for git clone (7cd557f)
  • ml/ray/start/kubernetes/port-forward/ray should retry (1744483)
  • ml/ray/stop should not require the ray cli (df28bfc)

Features

  • allow guidebooks/ml/codeflare/training/roberta/demo s3 defaults to be overridden (cfdd201)
  • support for s3fs in ml/codeflare/training/roberta (0fd23a0)

0.9.0 (2022-08-31)

Features

  • update ml/codeflare/training/roberta/clone.sh to support ssh keys from secrets (ae21a5b)

0.8.3 (2022-08-31)

Bug Fixes

  • ibmcloud detector was not detecting anything (72a172b)

0.8.2 (2022-08-31)

Bug Fixes

  • increase connection timeout in ibmcloud detector (a13afad)

0.8.1 (2022-08-31)

Bug Fixes

  • in ml/codeflare/training/roberta, lower b_size for sample input (4e2e6b2)

0.8.0 (2022-08-31)

Bug Fixes

  • add get verb to ray head node role (0094ee8)

Features

  • associate ray head pod with a serviceaccount (9872455)

0.7.3 (2022-08-31)

Bug Fixes

  • update ml/codeflare/training/roberta to work outside of ibmcloud (73e2ccf)

0.7.2 (2022-08-31)

Bug Fixes

  • ml/codeflare/training/roberta regression in job submission (d6ae0f6)

0.7.1 (2022-08-31)

Bug Fixes

  • add a marker to the end of the getting-started guidebook (1dbd721)

0.7.0 (2022-08-30)

Features

  • add ml/codeflare/training/demos/getting-started (857964f)

0.6.7 (2022-08-30)

Bug Fixes

  • improve intro paragraph to the roberta guidebook (5421213)

0.6.6 (2022-08-30)

0.6.5 (2022-08-30)

Bug Fixes

  • allow roberta github location defaults to be overridden (2b167c8)

0.6.4 (2022-08-29)

Bug Fixes

  • switch to a stability branch of the ml/codeflare/training/roberta code (9f4d0bb)

0.6.3 (2022-08-29)

Bug Fixes

  • ml/ray/cluster/kubernetes/is-ready can emit "workers 1/0" (6ebccce)

0.6.2 (2022-08-28)

Bug Fixes

  • update ml/codeflare/training/roberta to specify --no-input to ray-submit intrinsic (b955284)

0.6.1 (2022-08-27)

Bug Fixes

  • restore ray cli-based log streaming (0b89f4d)

0.6.0 (2022-08-26)

Bug Fixes

  • ray helm chart incorrectly always uses 1 worker replica (569c4ad)

Features

  • add ml/codeflare/training/roberta (f2e6a6b)

0.5.8 (2022-08-25)

Bug Fixes

  • change ^ to >= for peer dependence on madwizard (43ab1f6)

0.5.7 (2022-08-23)

Bug Fixes

  • a few minor refinements to log/debug output (40447b0)
  • ml/ray/cluster/head lacks KUBE_CONTEXT/NS in kubectl command (b6adff1)
  • remove old non-websocat way of fetching ray job logs (be6c829)

0.5.6 (2022-08-23)

Bug Fixes

  • server-side aggregator needs to wait for job to be actually running (0048c68)

0.5.5 (2022-08-23)

Bug Fixes

0.5.4 (2022-08-23)

Bug Fixes

  • two more fixes for log aggregator not waiting long enough (0853310)

0.5.3 (2022-08-23)

Bug Fixes

  • client-side log aggregator may fail to collect job logs (83be1f0)

0.5.2 (2022-08-23)

Bug Fixes

  • another fix for log aggregator server side ray head wait (a3ada23)

0.5.1 (2022-08-22)

Bug Fixes

  • improved logic for server-side ray head wait (4cc34ca)

0.5.0 (2022-08-22)

Bug Fixes

  • server-side log aggregator may try to query ray head before ready (6cda2ae)

Features

  • allow log aggregator image to be specified by env var (2ca584d)

0.4.8 (2022-08-22)

Bug Fixes

  • log aggregator deploy wait failure (c6c5305)

0.4.7 (2022-08-22)

Bug Fixes

  • log aggregator deploy should wait for it to be ready (bdc7a5c)

0.4.6 (2022-08-22)

Bug Fixes

  • disable node-stats.sh collection in log aggregator (8162626)

0.4.5 (2022-08-22)

Bug Fixes

  • ugh, ~/ -> /root for server-side log aggregator (7907824)

0.4.4 (2022-08-22)

Bug Fixes

  • parse error in ml/ray/run/logs/via-websocat (728b285)

0.4.3 (2022-08-22)

Bug Fixes

  • use ~/ instead of /home/node in log aggregator (efd950d)

0.4.2 (2022-08-22)

Bug Fixes

  • update log aggregator to use /home/node rather than /home/codeflare (a025f21)

0.4.1 (2022-08-22)

Bug Fixes

  • add nodes to the log aggregator Role (23c231d)
  • util/jobid should inherit JOB_ID from env (a45f054)

0.4.0 (2022-08-21)

Bug Fixes

  • increase cpu limit for ray head (303ee6d)

Features

  • remove untested knative guidebooks (71f0809)

0.3.3 (2022-08-19)

Bug Fixes

  • allow splicing in of namespace in ray helm chart (a22d837)

0.3.2 (2022-08-19)

Bug Fixes

  • allow in-cluster operation to specify a default context and namespace name (2de8f2e)

0.3.1 (2022-08-19)

Bug Fixes

  • kubernetes/context does not work when running in-cluster (6d78ad4)

0.3.0 (2022-08-19)

Features

  • support for in-kube-cluster start of ray cluster (8c32df5)

0.2.4 (2022-08-18)

Bug Fixes

  • add missing kubernetes/mcad/uninstall (7dc3858)
  • kubernetes/mcad/install is missing helm3 prereq (3cff3e3)
  • ml/ray/cluster/kubernetes/choose-pod-scheduler incorrectly sets up mcad preinstalled (4d8d88d)

0.2.3 (2022-08-17)

Bug Fixes

  • increase pod readiness wait timeout in ml/ray/start/kubernetes (c7ea666)

0.2.2 (2022-08-17)

Bug Fixes

  • ml/ray/start/kubernetes does not wait for ray head to be fully ready (ec10241)

0.2.1 (2022-08-17)

Bug Fixes

  • a few missing ${KUBE_CONTEXT_ARG} in mcad and the coscheduler validators (083d88c)
  • MCAD without coscheduler still installs the coscheduler (ad5cfa7)

0.2.0 (2022-08-17)

Features

  • allow mcad resource requirements to be dialed down when running CI (ceb0dd1)

0.1.1 (2022-08-17)

Bug Fixes

  • regression, ray operator is no longer namespaced (7bf4de4)

0.1.0 (2022-08-17)

Bug Fixes

  • another regression fix for helm chart (685864b)

Features

  • pin a branch for the default ray helm chart (7a4900d)

0.0.5 (2022-08-17)

Bug Fixes

  • typo from prior commit in helm chart (8b175e7)

0.0.4 (2022-08-17)

Bug Fixes

  • back out conditional installation of ray in chart (280d8fb)

0.0.3 (2022-08-17)

Bug Fixes

  • ml/ray/start/kubernetes helm chart does not set head cpu based on profile (c2a1507)

0.0.2 (2022-08-16)

Bug Fixes

  • regression, fractional cpus were ignored in ml/lray/start/kubernetes (612cc5b)