Skip to content

Commit fe01c25

Browse files
author
Omri Almog
committed
SynapseAi 1.16.1 release
* Update dockerfiles with 1.16.1 content
1 parent 223a927 commit fe01c25

25 files changed

+309
-297
lines changed

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
1-
# Gaudi Setup and Installation
1+
# Intel® Gaudi® Accelerator Setup and Installation
22

33
<br />
44

55
---
66

77
<br />
88

9-
By installing, copying, accessing, or using the software, you agree to be legally bound by the terms and conditions of the Habana software license agreement [defined here](https://habana.ai/habana-outbound-software-license-agreement/).
9+
By installing, copying, accessing, or using the software, you agree to be legally bound by the terms and conditions of the Intel Gaudi software license agreement [defined here](https://habana.ai/habana-outbound-software-license-agreement/).
1010

1111
<br />
1212

@@ -18,7 +18,7 @@ By installing, copying, accessing, or using the software, you agree to be legall
1818

1919
Welcome to Setup and Installation GitHub Repository!
2020

21-
The full installation documentation has been consolidated into the Installation Guide in our Habana Documentation. Please reference our [Habana docs](https://docs.habana.ai/en/latest/Installation_Guide/GAUDI_Installation_Guide.html) for the full installation guide.
21+
The full installation documentation has been consolidated into the Installation Guide in our Intel Gaudi Documentation. Please reference our [Intel Gaudi docs](https://docs.habana.ai/en/latest/Installation_Guide/GAUDI_Installation_Guide.html) for the full installation guide.
2222

2323
This respository contains the following references:
2424
- dockerfiles -- Reference dockerfiles and build script to build Gaudi Docker images

dockerfiles/base/Dockerfile.rhel8.6

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,13 @@ RUN dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.n
1818

1919
RUN echo "[appstream]" > /etc/yum.repos.d/CentOS-Linux-AppStream.repo && \
2020
echo "name=CentOS Linux 8 - AppStream" >> /etc/yum.repos.d/CentOS-Linux-AppStream.repo && \
21-
echo "mirrorlist=http://mirrorlist.centos.org/?release=\$releasever-stream&arch=\$basearch&repo=AppStream&infra=\$infra" >> /etc/yum.repos.d/CentOS-Linux-AppStream.repo && \
21+
echo "baseurl=https://vault.centos.org/8-stream/AppStream/x86_64/os" >> /etc/yum.repos.d/CentOS-Linux-AppStream.repo && \
2222
echo "gpgcheck=0" >> /etc/yum.repos.d/CentOS-Linux-AppStream.repo
2323

2424

2525
RUN echo "[BaseOS]" > /etc/yum.repos.d/CentOS-Linux-BaseOS.repo && \
2626
echo "name=CentOS Linux 8 - BaseOS" >> /etc/yum.repos.d/CentOS-Linux-BaseOS.repo && \
27-
echo "mirrorlist=http://mirrorlist.centos.org/?release=\$releasever-stream&arch=\$basearch&repo=BaseOS&infra=\$infra" >> /etc/yum.repos.d/CentOS-Linux-BaseOS.repo && \
27+
echo "baseurl=https://vault.centos.org/8-stream/BaseOS/x86_64/os" >> /etc/yum.repos.d/CentOS-Linux-BaseOS.repo && \
2828
echo "gpgcheck=0" >> /etc/yum.repos.d/CentOS-Linux-BaseOS.repo
2929

3030
RUN dnf install -y \
@@ -77,7 +77,7 @@ RUN echo "[habanalabs]" > /etc/yum.repos.d/habanalabs.repo && \
7777

7878
RUN echo "[powertools]" > /etc/yum.repos.d/powertools.repo && \
7979
echo "name=powertools" >> /etc/yum.repos.d/powertools.repo && \
80-
echo "baseurl=http://mirror.centos.org/centos/8-stream/PowerTools/x86_64/os/" >> /etc/yum.repos.d/powertools.repo && \
80+
echo "baseurl=https://vault.centos.org/8-stream/PowerTools/x86_64/os/" >> /etc/yum.repos.d/powertools.repo && \
8181
echo "gpgcheck=0" >> /etc/yum.repos.d/powertools.repo
8282

8383
RUN dnf install -y habanalabs-rdma-core-"$VERSION"-"$REVISION".el8 \

dockerfiles/common.mk

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@ BUILD_DIR ?= $(CURDIR)/dockerbuild
66

77
REPO_SERVER ?= vault.habana.ai
88
PT_VERSION ?= 2.2.2
9-
RELEASE_VERSION ?= 1.16.0
10-
RELEASE_BUILD_ID ?= 526
9+
RELEASE_VERSION ?= 1.16.1
10+
RELEASE_BUILD_ID ?= 7
1111

1212
BASE_IMAGE_URL ?= base-installer-$(BUILD_OS)
1313
IMAGE_URL = $(IMAGE_NAME):$(RELEASE_VERSION)-$(RELEASE_BUILD_ID)

utils/README.md

Lines changed: 30 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Gaudi Utils
22

3-
By installing, copying, accessing, or using the software, you agree to be legally bound by the terms and conditions of the Habana software license agreement [defined here](https://habana.ai/habana-outbound-software-license-agreement/).
3+
By installing, copying, accessing, or using the software, you agree to be legally bound by the terms and conditions of the Intel Gaudi software license agreement [defined here](https://habana.ai/habana-outbound-software-license-agreement/).
44

55
## Table of Contents
66

@@ -14,100 +14,100 @@ By installing, copying, accessing, or using the software, you agree to be legall
1414
- [Status](#status)
1515
- [Set IP](#set-ip)
1616
- [Unset IP](#unset-ip)
17-
- [check\_habana\_framework\_env](#check_habana_framework_env)
18-
- [Habana Health Screen (HHS)](#habana-health-screen-hhs)
17+
- [check\_framework\_env](#check_framework_env)
18+
- [Intel Gaudi Health Screen (IGHS)](#intel-gaudi-health-screen-ighs)
1919

2020
## Overview
2121

22-
Welcome to Gaudi's Util Scripts!
22+
Welcome to Intel Gaudi's Util Scripts!
2323

24-
This folder contains some Gaudi utility scripts that users can access as reference.
24+
This folder contains some Intel Gaudi utility scripts that users can access as reference.
2525

2626
## manage_network_ifs
2727

2828
Moved to habanalabs-qual Example: (/opt/habanalabs/qual/gaudi2/bin/manage_network_ifs.sh).
2929

30-
This script can be used as reference to bring up, take down, set IPs, unset IPs and check for status of the Gaudi network interfaces.
30+
This script can be used as reference to bring up, take down, set IPs, unset IPs and check for status of the Intel Gaudi network interfaces.
3131

3232
The following is the usage of the script:
3333

3434
```
3535
usage: ./manage_network_ifs.sh [options]
3636
3737
options:
38-
--up toggle up all Habana network interfaces
39-
--down toggle down all Habana network interfaces
40-
--status print status of all Habana network interfaces
41-
--set-ip set IP for all internal Habana network interfaces
42-
--unset-ip unset IP from all internal Habana network interfaces
38+
--up toggle up all Intel Gaudi network interfaces
39+
--down toggle down all Intel Gaudi network interfaces
40+
--status print status of all Intel Gaudi network interfaces
41+
--set-ip set IP for all internal Intel Gaudi network interfaces
42+
--unset-ip unset IP from all internal Intel Gaudi network interfaces
4343
-v, --verbose print more logs
4444
-h, --help print this help
4545
4646
Note: Please run this script with one operation at a time
4747
```
4848
## Operations
4949

50-
Before executing any operation, this script finds all the Habana network interfaces available on the system and stores the Habana interface information into a list.
51-
The list will be used for the operations. If no Habana network interface is found, the script will exit.
50+
Before executing any operation, this script finds all the Intel Gaudi network interfaces available on the system and stores the Intel Gaudi interface information into a list.
51+
The list will be used for the operations. If no Intel Gaudi network interface is found, the script will exit.
5252

5353
### Up
5454

55-
Use the following command to bring all Habana network interfaces online:
55+
Use the following command to bring all Intel Gaudi network interfaces online:
5656
```
5757
sudo manage_network_ifs.sh --up
5858
```
59-
Once all the Habana interfaces are toggled up, IPs will be set by default. Please refer [Set Ip](#set-ip) for more detail. To unset IPs, run this script with '--unset-ip'
59+
Once all the Intel Gaudi interfaces are toggled up, IPs will be set by default. Please refer [Set Ip](#set-ip) for more detail. To unset IPs, run this script with '--unset-ip'
6060
### Down
6161

62-
Use the following command to bring all Habana network interfaces offline:
62+
Use the following command to bring all Intel Gaudi network interfaces offline:
6363
```
6464
sudo manage_network_ifs.sh --down
6565
```
6666
### Status
6767

68-
Print the current operational state of all Habana network interfaces such as how many ports are up/down:
68+
Print the current operational state of all Intel Gaudi network interfaces such as how many ports are up/down:
6969
```
7070
sudo manage_network_ifs.sh --status
7171
```
7272
### Set IP
7373

74-
Use the following command to assign a default IP for all Habana network interfaces:
74+
Use the following command to assign a default IP for all Intel Gaudi network interfaces:
7575
```
7676
sudo manage_network_ifs.sh --set-ip
7777
```
7878
Note: Default IPs are 192.168.100.1, 192.168.100.2, 192.168.100.3 and so on
7979
### Unset IP
8080

81-
Remove IP from all available Habana network interfaces by the following command:
81+
Remove IP from all available Intel Gaudi network interfaces by the following command:
8282
```
8383
sudo manage_network_ifs.sh --unset-ip
8484
```
8585

86-
## check_habana_framework_env
86+
## check_framework_env
8787

88-
This script can be used as reference to check the environment for running PyTorch on Habana.
88+
This script can be used as reference to check the environment for running PyTorch on Intel Gaudi.
8989

9090
The following is the usage of the script:
9191

9292
```
93-
usage: check_habana_framework_env.py [-h] [--cards CARDS]
93+
usage: check_framework_env.py [-h] [--cards CARDS]
9494
95-
Check health of HPUs for PyTorch
95+
Check health of Intel Gaudi for PyTorch
9696
9797
optional arguments:
9898
-h, --help show this help message and exit
9999
--cards CARDS Set number of cards to test (default: 1)
100100
```
101101

102-
## Habana Health Screen (HHS)
102+
## Intel Gaudi Health Screen (IGHS)
103103

104-
**Habana Health Screen** (HHS) tool has been developed to verify the cluster network health through a suite of diagnostic tests. The test
104+
**Intel Gaudi Health Screen** (IGHS) tool has been developed to verify the cluster network health through a suite of diagnostic tests. The test
105105
includes checking gaudi port status, running small workloads, and running standard collective operations arcoss multiple systems.
106106

107107
``` bash
108108
usage: screen.py [-h] [--initialize] [--screen] [--target-nodes TARGET_NODES]
109109
[--job-id JOB_ID] [--round ROUND] [--config CONFIG]
110-
[--hhs-check [{node,hccl-demo,none}]] [--node-write-report]
110+
[--ighs-check [{node,hccl-demo,none}]] [--node-write-report]
111111
[--node-name NODE_NAME] [--logs-dir LOGS_DIR]
112112

113113
optional arguments:
@@ -119,18 +119,18 @@ optional arguments:
119119
--job-id JOB_ID Needed to identify hccl-demo running log
120120
--round ROUND Needed to identify hccl-demo running round log
121121
--config CONFIG Configuration file for Health Screener
122-
--hhs-check [{node,hccl-demo,none}]
123-
Check HHS Status for Node (Ports status, Device Acquire Fail) or all_reduce
122+
--ighs-check [{node,hccl-demo,none}]
123+
Check IGHS Status for Node (Ports status, Device Acquire Fail, Device Temperature) or all_reduce
124124
(HCCL_DEMO between paris of nodes)
125125
--node-write-report Write Individual Node Health Report
126126
--node-name NODE_NAME Name of Node
127127
--logs-dir LOGS_DIR Output directory of health screen results
128128
```
129129
130-
To run a full HHS test, run the below command:
130+
To run a full IGHS test, run the below command:
131131
132132
``` bash
133-
# Creates HHS Report and screens clusters for any infected nodes.
133+
# Creates IGHS Report and screens clusters for any infected nodes.
134134
# Will check Level 1 and 2 by default
135135
python screen.py --initialize --screen
136136
```
Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
import concurrent.futures
1616

1717
def parse_arguments():
18-
parser = argparse.ArgumentParser(description="Check health of HPUs for PyTorch")
18+
parser = argparse.ArgumentParser(description="Check health of Intel Gaudi for PyTorch")
1919

2020
parser.add_argument("--cards",
2121
default=1,
@@ -29,11 +29,11 @@ def parse_arguments():
2929
return args
3030

3131
def pytorch_test(device_id=0):
32-
""" Checks health of HPU through running a basic
33-
PyTorch example on HPU
32+
""" Checks health of Intel Gaudi through running a basic
33+
PyTorch example on Intel Gaudi
3434
3535
Args:
36-
device_id (int, optional): ID of HPU. Defaults to 0.
36+
device_id (int, optional): ID of Intel Gaudi. Defaults to 0.
3737
"""
3838

3939
os.environ["ID"] = str(device_id)
@@ -42,15 +42,15 @@ def pytorch_test(device_id=0):
4242
import torch
4343
import habana_frameworks.torch.core
4444
except Exception as e:
45-
print(f"Card {device_id} Failed to initialize Habana PyTorch: {str(e)}")
45+
print(f"Card {device_id} Failed to initialize Intel Gaudi PyTorch: {str(e)}")
4646
raise
4747

4848
try:
4949
x = torch.tensor([2]).to('hpu')
5050
y = x + x
5151

5252
assert y == 4, 'Sanity check failed: Wrong Add output'
53-
assert 'hpu' in y.device.type.lower(), 'Sanity check failed: Operation not executed on Habana Device'
53+
assert 'hpu' in y.device.type.lower(), 'Sanity check failed: Operation not executed on Intel Gaudi Card'
5454
except (RuntimeError, AssertionError) as e:
5555
print(f"Card {device_id} Failure: {e}")
5656
raise
@@ -64,7 +64,7 @@ def pytorch_test(device_id=0):
6464
for device_id, res in zip(range(args.cards), executor.map(pytorch_test, range(args.cards))):
6565
print(f"Card {device_id} PASSED")
6666
except Exception as e:
67-
print(f"Failed to initialize Habana, error: {str(e)}")
67+
print(f"Failed to initialize on Intel Gaudi, error: {str(e)}")
6868
print(f"Check FAILED")
6969
exit(1)
7070

utils/habana_health_screen/version.txt

Lines changed: 0 additions & 1 deletion
This file was deleted.
File renamed without changes.

utils/habana_health_screen/HabanaHealthReport.py renamed to utils/intel_gaudi_health_screen/HealthReport.py

Lines changed: 23 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,12 @@
1818

1919
import logging
2020

21-
_logger = logging.getLogger("habana_health_screener")
21+
_logger = logging.getLogger("health_screener")
2222

23-
class HabanaHealthReport():
23+
class HealthReport():
2424

2525
def __init__(self, f_dir="tmp", report_name="health_report.csv"):
26-
""" Initialize Habana Health Report Class
26+
""" Initialize Health Report Class
2727
2828
Args:
2929
f_dir (str, optional): File Directory to store Health Report logs and results. Defaults to "tmp".
@@ -83,8 +83,8 @@ def write_rows(self, cards=list(), node_id="", data=list(), level=1):
8383
""" Write health check results to Health Report CSV. Can write multiple rows at once
8484
8585
Args:
86-
cards ([HCard], optional): Level 1 HCards to report about. Defaults to list().
87-
node_id (str, optional): Node ID of HCards. Defaults to "".
86+
cards ([IGCard], optional): Level 1 IGCards to report about. Defaults to list().
87+
node_id (str, optional): Node ID of IGCards. Defaults to "".
8888
data (_type_, optional): Health Report CSV Row data. Defaults to list().
8989
level (int, optional): Health Screen Level. Defaults to 1.
9090
"""
@@ -118,12 +118,12 @@ def update_health_report(self, detected_nodes, infected_nodes, missing_nodes):
118118
infected_nodes (list[str]): List of infected node_ids
119119
missing_nodes (list[str]): List of missing node_ids
120120
"""
121-
tempfile = NamedTemporaryFile(mode='w', delete=False)
121+
temp_file = NamedTemporaryFile(mode='w', delete=False)
122122
detected_nodes_cp = detected_nodes.copy()
123123

124-
with open(self.f_path, 'r', newline='') as csvfile, tempfile:
125-
reader = csv.DictReader(csvfile)
126-
writer = csv.DictWriter(tempfile, fieldnames=self.header)
124+
with open(self.f_path, 'r', newline='') as csv_file, temp_file:
125+
reader = csv.DictReader(csv_file)
126+
writer = csv.DictWriter(temp_file, fieldnames=self.header)
127127

128128
writer.writeheader()
129129
for row in reader:
@@ -148,22 +148,22 @@ def update_health_report(self, detected_nodes, infected_nodes, missing_nodes):
148148
for n in missing_nodes:
149149
writer.writerow({"node_id": n, "multi_node_fail": True, "missing": True})
150150

151-
shutil.move(tempfile.name, self.f_path)
151+
shutil.move(temp_file.name, self.f_path)
152152

153153
def update_hccl_demo_health_report(self, round, all_node_pairs, multi_node_fail, qpc_fail, missing_nodes):
154154
""" Update health_report with hccl_demo results, based on infected_nodes.
155155
156156
Args:
157-
all_node_pairs (list[str]): List of all node pairs reported by Level 2 round
157+
all_node_pairs (list[str]): List of all Node Pairs reported by Level 2 round
158158
multi_node_fail (list[str]): List of Node Pairs that failed HCCL_Demo Test
159159
qpc_fail (list[str]): List of Node Pairs that failed HCCL_Demo Test due to QPC error
160160
missing_nodes (list[str]): List of Node Pairs that couldn't run HCCL_Demo
161161
"""
162-
tempfile = NamedTemporaryFile(mode='w', delete=False)
162+
temp_file = NamedTemporaryFile(mode='w', delete=False)
163163

164-
with open(self.f_path_hccl_demo, 'r', newline='') as csvfile, tempfile:
165-
reader = csv.DictReader(csvfile)
166-
writer = csv.DictWriter(tempfile, fieldnames=self.header_hccl_demo, extrasaction='ignore')
164+
with open(self.f_path_hccl_demo, 'r', newline='') as csv_file, temp_file:
165+
reader = csv.DictReader(csv_file)
166+
writer = csv.DictWriter(temp_file, fieldnames=self.header_hccl_demo, extrasaction='ignore')
167167

168168
writer.writeheader()
169169
for row in reader:
@@ -181,7 +181,7 @@ def update_hccl_demo_health_report(self, round, all_node_pairs, multi_node_fail,
181181
if len(all_node_pairs):
182182
writer.writerows(list(all_node_pairs.values()))
183183

184-
shutil.move(tempfile.name, self.f_path_hccl_demo)
184+
shutil.move(temp_file.name, self.f_path_hccl_demo)
185185

186186
def check_screen_complete(self, num_nodes, hccl_demo=False, round=0):
187187
""" Check on status of Health Screen Check.
@@ -306,11 +306,11 @@ def gather_health_report(self, level, remote_path, hosts):
306306
""" Gathers Health Report from all hosts
307307
308308
Args:
309-
level (str): HHS Level
310-
remote_path (str): Remote Destintation of HHS Report
311-
hosts (list, optional): List of IP Addresses to gather HHS Reports
309+
level (str): IGHS Level
310+
remote_path (str): Remote Destintation of IGHS Report
311+
hosts (list, optional): List of IP Addresses to gather IGHS Reports
312312
"""
313-
copy_files(src=f"{remote_path}/habana_health_screen/{self.f_dir}/L{level}",
313+
copy_files(src=f"{remote_path}/intel_gaudi_health_screen/{self.f_dir}/L{level}",
314314
dst=f"{self.f_dir}",
315315
hosts=hosts,
316316
to_remote=False)
@@ -319,16 +319,16 @@ def consolidate_health_report(self, level, report_dir):
319319
""" Consolidates the health_report_*.csv from worker pods into a single master csv file
320320
321321
Args:
322-
level (str): HHS Level
322+
level (str): IGHS Level
323323
report_dir (str): Directory of CSV files to merge
324324
"""
325325
data = list()
326326
path = f"{report_dir}/L{level}/health_report_*.csv"
327327
csv_files = glob.glob(path)
328328

329329
for f in csv_files:
330-
with open(f, 'r', newline='') as csvfile:
331-
reader = csv.DictReader(csvfile)
330+
with open(f, 'r', newline='') as csv_file:
331+
reader = csv.DictReader(csv_file)
332332
for row in reader:
333333
data.append(row)
334334

0 commit comments

Comments
 (0)