Bilder vom Neuenheimer Feld, Heidelberg und der Universität Heidelberg

Begleitveranstaltung zum Softwareprojekt - Tutorium

Disclaimer

This session is an unofficial offer. Not related to any credit points, grading criteria etc.

Date & Room

3rd May 2022, 13:15–14:45, SR24 INF325 (or CIP pool, if available)

Cluster Login

[INFO] for new cluster users
Please subscribe the mailing list: https://lists.cl.uni-heidelberg.de/listinfo/cluster-users

[ATTENTION] Please check if you can login to the cluster:


      $ ssh <username>@cluster.cl.uni-heidelberg.de

If not, please send an email to Gruppe Technik (ohta@cl.uni-heidelberg.de) with the following information:

Your Name
Your ICL email address (@cl.uni-heidelberg.de)
What for you need cluster access (tell them you are participating in SWP ss22.)

They will grand access to the cluster for the participants of the software project.

Preparation

The goal of the session is to share your experience with other participants. Since peer-to-peer learning is a part of the learning objectives of the software project, I decided to hold this session in "inverted classroom" style. I propose several pre-class materials here, and in-class, I'll ask the questions listed below. Please prepare your own answer to these questions. It doesn't mean I expect a correct answer, but includes something like: "I got this error message: XXX when I executed the command YYY. To solve this, I tried the option proposed in stackoverflow..."

I look forward to your active participation!

Topic 1: Cluster

[Preparation Workload: 90 min.]

First, go through the following tutorials:

Slurm Tutorial
GPU How-to
Slides from Ressourcenvorkurs WS21 Day 4 (Lecturer: Hiko Schamoni)
* For virtual environment, I personally recommend conda, not venv, though.

Questions

How can you login to the Cluster? Which command?
You logged in the Cluster. Are Slurm commands (srun, squeue, scancel, etc.) available in your environment? If not, what should you do to make these available?
Which partitions are there? Out of them, which partitions are dedicated to `students` (not for `mitarb`)? Which node is currently occupied by whom?
How many jobs are currently running on the partition gpulong?
Allocate resources for an interactive job with salloc command. Provide the following options:
- The task can run max. 10 min.
- The task needs 100MB memory.
- The task needs 4 CPUs.
Check if your job does appear in the job queue. Which node is assigned for you?
Call srun hostname command. What does the returned value mean?
Call hostname command without srun. Why it's different from srun hostname?
Confirm the python version currently available on the allocated node.
Call sacct command to monitor your job. Which ID is assigned to your job?
Revoke the allocated resources. Check if your job doesn't appear in the job queue.
Write a bash script that prints the visible GPU devices of one gpushort node.
Execute the bash script: ~~first, reserve resources with salloc, then~~ call the script with sbatch, check the output logs. How do you know whether your jobs are completed or not?
How can you enter the console shell of a GPU node?
You entered the shell of a GPU node. Which CUDA version is currently available on the node?
You entered the shell of a GPU node. Install Pytorch under your virtual environment, and check if `torch.cuda.is_available()` returns True in Python.
You entered the shell of a GPU node, and started a GPU-required job. How you can monitor the GPU memory usage?
You entered the shell of a GPU node, and started a job which takes for 2 days. How can you keep the job running even after you logged out from the cluster?
- Hint: console multiplexer i.e. tmux, screen, byobu are globally installed on Cluster!
How do you exit the console of the node and go back to `login` node?
Install jupyter (jupyterlab) in your virtual environment. Start jupyter notebook server on a cluster node. Open it via web-browser from your local computer. Import pytorch, check which version number has your pytorch.
Shutdown the jupyter server. Make sure, the port you used to connect jupyter is free, now.
[EXTRA] You can find no free slot on the Cluster GPUs. Is there any other possibility to access GPUs?
- Hint: Google Colab (local runtime), Amazon SageMaker Studio Lab, bwUniCluster (Dean of the Institute will grant access. Please describe in the online application form in which group/topic you are working on Software Project SS22.)

Topic 2: Remote development / useful tools

[Preparation Workload: 60 min.]

Questions

Does your IDE support remote development? If yes, set up the connection to the Cluster.
- Hint: VSCode, PyCharm (you may be able to apply for educational license to get free access to professional features)
You have your own data on your laptop. Which command will transfer the data to the Cluster? Try a file transfer tool with GUI.
- Hint: FileZilla, WinSCP, or integrated in IDEs
How can you edit a file stored on the Cluster without explicitly downloading it to your local computer?
You keep getting Quota exceeded error. What you can do to avoid it?
- Hint: Read our internal FAQ wiki. (Project directory in Cluster: /scratch and /workspace)
Create a repository in gitlab hosted by the institute. (either an empty one or imported from other sources.)
You don't want to create a repo under your namespace, but you want to have a shared namespace for your group. What you can do?
- Hint: GitLab docs Groups.
Say, you found a publicly available codebase somewhere (github, bitbucket, etc.) Clone your CL gitlab repo to your local, and add the public repo with the name "upstream". That is, you will have two remote repos "origin" and "upstream".
- Hint: Managing remote repositories
During development, you've seen some changes in the "upstream". How can you take these changes into your local repo, without overwriting?
Have you worked with python debugger? python unittest? *this will be skipped in the session...
- I recommend "Entwicklertools" section from Ressourcenvorkurs WS21, accompanied by great exercises.
Have you tried any linter, such as pylint, flake8, black, isort? *this will be skipped in the session...
[EXTRA] You want to install XXX on the cluster, and the instruction says you need `sudo` to install it. How can you avoid sudo? (ex. sentencepiece)

Topic 3: Frameworks

[Preparation Workload: 30 min.]

Choose one framework / package below, and run a quick-start tutorial on the Cluster. Most of them are provided in Jupyter Notebook format. You don't have to write any code! Just run the provided notebook as is. Do you encounter any error?

Pytorch: seq2seq translation tutorial
Pytorch Lightning: Text Transformers
fairseq: Neural Machine Translation
- Choose one pre-trained model, and the corresponding dataset. Run inference and check if you can reproduce the score presented in the paper.
Huggingface: Quicktour, Task summary
Tensorflow: Transformer for language understanding
fastai: Transfer learning in text
Any publicly available quick-start tutorial (potentially) related to your project

Imagine, you are writing a quick-start tutorial of your code developed in the software project. Do you think the tutorial you executed above can be a good template?

Topic 4: Work in a team

Documentation
- Take meeting notes everytime you meet. Share it online.
- Set one static access point for all.
- Use pictures, graphs, etc.
- Start writing TODAY!
Planning
- Plan iterative, incremental cycles. Linear project management often fails!
- Simple first, easy first.
- Working demo is more convincing than a fancy theory.
Coding
- Do code reviews.
- Keep a pull request small.
- Make a commit self-explanatory.
- cf.) the 15 min. rule of software development
- cf.) Rubber-duck debugging
Communication
- cf.) Psychological safety
- cf.) OKR: Objectives and Key Results

Good Luck!

Do you have any other topics should be covered in the session? Please write me an email! (ohta@cl.uni-heidelberg.de)