sciware

Sciware

Flatiron Summer Workshops

https://sciware.flatironinstitute.org/40_SummerIntro

Today’s agenda

Cluster overview

<img height=80% width=80% margin=”10px auto” class=”plain” src=”assets/cluster/overview_1.png”>

Rusty

https://wiki.flatironinstitute.org/SCC/Hardware/Rusty

Popeye

https://wiki.flatironinstitute.org/SCC/Hardware/Popeye

Remote access

https://wiki.flatironinstitute.org/SCC/RemoteConnect

Modules & software

Overview

module avail: Core

------------- Core --------------
gcc/10.5.0                
gcc/11.4.0               (D)
gcc/12.2.0
openblas/single-0.3.26   (S,L)
openblas/threaded-0.3.26 (S,D)
python/3.10.13           (D)
python/3.11.7
...

module load or ml

module show

Other module commands

module releases

modules/2.3-20240529 (S,L,D)
modules/2.4-beta2    (S)

Python environments

Python packages: venv

module load python
python3 -m venv --system-site-packages ~/myvenv
source ~/myvenv/bin/activate
pip install ...

Too much typing

Put common sets of modules in a script

# File: ~/mymods
module reset
module load gcc python hdf5 git
source ~/myvenv/bin/activate

And “source” it when needed:

source ~/mymods

Other software

If you need something not in the base system, modules, or pip:

Jupyter

VS code remote

Break

Survey

Running Jobs on the FI Cluster

Slurm and Parallelism

How to run jobs efficiently on Flatiron’s clusters

Slurm

</img>

Slurm

https://wiki.flatironinstitute.org/SCC/Software/Slurm

Batch file

Write a batch file called myjob.sbatch that specifies the resources needed.

```bash #!/bin/bash #SBATCH --partition=genx # Non-exclusive partition #SBATCH --ntasks=1 # Run one instance #SBATCH --cpus-per-task=1 # Cores? #SBATCH --mem=1G # Memory? #SBATCH --time=00:10:00 # Time? (10 minutes) echo "Starting..." hostname sleep 1m echo "Done!" ```
</img>

Submitting a job

Where is my output?

Loading environments

Good practice is to load the modules you need in the script:

#!/bin/bash
#SBATCH ...

module reset
module load gcc python
source ~/myvenv/bin/activate

# (or: source ~/mymods)

python3 myscript.py

Running Jobs in Parallel

What about multiple things?

Let’s say we have 10 files, each using 1 GB and 1 CPU

```bash #!/bin/bash #SBATCH --mem=10G # Request 10x the memory #SBATCH --time=02:00:00 # 2 hours #SBATCH --ntasks=10 # Run 10 tasks #SBATCH --cpus-per-task=1 # Request 1 CPU #SBATCH --partition=genx # this would run 10 identical tasks: #srun python3 myjob.py # instead run different things: srun -n1 python3 myjob.py data1.hdf5 & srun -n1 python3 myjob.py data2.hdf5 & srun -n1 python3 myjob.py data3.hdf5 & ... wait # wait for all background tasks to complete ```
</img>

Slurm Tip: Estimating Resource Requirements

Slurm Tip: Estimating Resource Requirements

  1. Guess based on your knowledge of the program. Think about the sizes of big arrays and any files being read
  2. Run a test job
  3. While the job is running, check squeue, ssh to the node, and run htop
  4. Check the actual usage of the test job with:
    seff <jobid>
    • Job Wall-clock time: how long it took in “real world” time; corresponds to #SBATCH -t
    • Memory Utilized: maximum amount of memory used; corresponds to #SBATCH --mem

Slurm Tip: Choosing a Partition (CPUs)

Running Jobs in Parallel

disBatch

disBatch

</img> </img> </img>

disBatch

0	1	-1	worker032	8016	0	10.0486528873	1458660919.78	1458660929.83	0	""	0	""	'...'
1	2	-1	worker032	8017	0	10.0486528873	1458660919.78	1458660929.83	0	""	0	""	'...'

Slurm Tip: Tasks and threads

GPUs

Summary of Parallel Jobs

</img>

File Systems

https://wiki.flatironinstitute.org/SCC/Hardware/Storage

Home Directory

Home Directory

Your home directory is for code, notes, and documentation.

It is NOT for:

  1. Large data sets downloaded from other sites
  2. Intermediate files generated and then deleted during the course of a computation
  3. Large output files

You are limited to 900,000 files and 450 GB (if you go beyond this you will not be able to log in)

Backups (aka snapshots)

If you accidentally delete some files, you can access backups through the .snapshots directory like this:
  
ls .snapshots
cp -a .snapshots/@GMT-2021.09.13-10.00.55/lost_file lost_file.restored
  
  • .snapshots is a special invisible directory and won't autocomplete
  • Snapshots happen once a day and are kept for 3-4 weeks
  • There are separate long-term backups of home if needed (years)

Ceph

* .snap is coming soon

Summary: Persistent storage

Monitoring Usage: /mnt/home

View a usage summary:


module load fi-utils
fi-quota

To track down large files or file counts use:


ncdu -x --show-itemcount ~

Monitoring Usage: /mnt/ceph

    
module load fi-utils
cephdu
    

Local Scratch

Summary: Temporary storage

Planned usage


module load fi-utils
fi-usage

Survey

Questions & Help

<img height=80% width=80% src=”assets/cluster/help.gif”>