Begleitveranstaltung zum Softwareprojekt - Tutorium
Disclaimer
This session is an unofficial offer. Not related to any credit points, grading criteria etc.Date & Room
3rd May 2022, 13:15–14:45, SR24 INF325 (or CIP pool, if available)Cluster Login
[INFO] for new cluster users
Please subscribe the mailing list:
https://lists.cl.uni-heidelberg.de/listinfo/cluster-users
[ATTENTION] Please check if you can login to the cluster:
$ ssh <username>@cluster.cl.uni-heidelberg.de
ohta@cl.uni-heidelberg.de
) with the following information:
- Your Name
- Your ICL email address (@cl.uni-heidelberg.de)
- What for you need cluster access (tell them you are participating in SWP ss22.)
Preparation
The goal of the session is to share your experience with other participants. Since peer-to-peer learning is a part of the learning objectives of the software project, I decided to hold this session in "inverted classroom" style. I propose several pre-class materials here, and in-class, I'll ask the questions listed below. Please prepare your own answer to these questions. It doesn't mean I expect a correct answer, but includes something like: "I got this error message: XXX when I executed the command YYY. To solve this, I tried the option proposed in stackoverflow..."
I look forward to your active participation!
Topic 1: Cluster
- Slurm Tutorial
- GPU How-to
- Slides
from Ressourcenvorkurs WS21 Day 4 (Lecturer: Hiko Schamoni)
* For virtual environment, I personally recommend conda, notvenv
, though.
Questions
- How can you login to the Cluster? Which command?
- You logged in the Cluster. Are Slurm commands (
srun, squeue, scancel,
etc.) available in your environment? If not, what should you do to make these available? - Which partitions are there? Out of them, which partitions are dedicated to `students` (not for `mitarb`)? Which node is currently occupied by whom?
- How many jobs are currently running on the partition
gpulong
? - Allocate resources for an interactive job with
salloc
command. Provide the following options:- The task can run max. 10 min.
- The task needs 100MB memory.
- The task needs 4 CPUs.
- Check if your job does appear in the job queue. Which node is assigned for you?
- Call
srun hostname
command. What does the returned value mean? - Call
hostname
command withoutsrun
. Why it's different fromsrun hostname
? - Confirm the python version currently available on the allocated node.
- Call
sacct
command to monitor your job. Which ID is assigned to your job? - Revoke the allocated resources. Check if your job doesn't appear in the job queue.
- Write a bash script that prints the visible GPU devices of one
gpushort
node. - Execute the bash script:
first, reserve resources withcall the script withsalloc
, thensbatch
, check the output logs. How do you know whether your jobs are completed or not? - How can you enter the console shell of a GPU node?
- You entered the shell of a GPU node. Which CUDA version is currently available on the node?
- You entered the shell of a GPU node. Install Pytorch under your virtual environment, and check if `torch.cuda.is_available()` returns True in Python.
- You entered the shell of a GPU node, and started a GPU-required job. How you can monitor the GPU memory usage?
- You entered the shell of a GPU node, and started a job which takes for 2 days. How can you keep the job running even after you logged out from the cluster?
- How do you exit the console of the node and go back to `login` node?
- Install jupyter (jupyterlab) in your virtual environment. Start jupyter notebook server on a cluster node. Open it via web-browser from your local computer. Import pytorch, check which version number has your pytorch.
- Shutdown the jupyter server. Make sure, the port you used to connect jupyter is free, now.
- [EXTRA] You can find no free slot on the Cluster GPUs. Is there any other possibility to access GPUs?
- Hint: Google Colab (local runtime), Amazon SageMaker Studio Lab, bwUniCluster (Dean of the Institute will grant access. Please describe in the online application form in which group/topic you are working on Software Project SS22.)
Topic 2: Remote development / useful tools
Questions
- Does your IDE support remote development? If yes, set up the connection to the Cluster.
- You have your own data on your laptop. Which command will transfer the data to the Cluster? Try a file transfer tool with GUI.
- How can you edit a file stored on the Cluster without explicitly downloading it to your local computer?
- You keep getting Quota exceeded error. What you can do to avoid it?
- Hint:
Read our internal FAQ wiki.
(Project directory in Cluster:
/scratch
and/workspace
)
- Hint:
Read our internal FAQ wiki.
(Project directory in Cluster:
- Create a repository in gitlab hosted by the institute. (either an empty one or imported from other sources.)
- You don't want to create a repo under your namespace, but you want to have a shared namespace for your group. What you can do?
- Hint: GitLab docs Groups.
- Say, you found a publicly available codebase somewhere (github, bitbucket, etc.) Clone your CL gitlab repo to your local, and add the public repo with the name "upstream". That is, you will have two remote repos "origin" and "upstream".
- During development, you've seen some changes in the "upstream". How can you take these changes into your local repo, without overwriting?
- Have you worked with python debugger? python unittest? *this will be skipped in the session...
- I recommend "Entwicklertools" section from Ressourcenvorkurs WS21, accompanied by great exercises.
- Have you tried any linter, such as pylint, flake8, black, isort? *this will be skipped in the session...
- [EXTRA] You want to install XXX on the cluster, and the instruction says you need `sudo` to install it. How can you avoid sudo? (ex. sentencepiece)
Topic 3: Frameworks
Choose one framework / package below, and run a quick-start tutorial on the Cluster. Most of them are provided in Jupyter Notebook format. You don't have to write any code! Just run the provided notebook as is. Do you encounter any error?
- Pytorch: seq2seq translation tutorial
- Pytorch Lightning: Text Transformers
- fairseq: Neural Machine Translation
- Choose one pre-trained model, and the corresponding dataset. Run inference and check if you can reproduce the score presented in the paper.
- Huggingface: Quicktour, Task summary
- Tensorflow: Transformer for language understanding
- fastai: Transfer learning in text
- Any publicly available quick-start tutorial (potentially) related to your project
Topic 4: Work in a team
- Documentation
- Take meeting notes everytime you meet. Share it online.
- Set one static access point for all.
- Use pictures, graphs, etc.
- Start writing TODAY!
- Planning
- Plan iterative, incremental cycles. Linear project management often fails!
- Simple first, easy first.
- Working demo is more convincing than a fancy theory.
- Coding
- Do code reviews.
- Keep a pull request small.
- Make a commit self-explanatory.
- cf.) the 15 min. rule of software development
- cf.) Rubber-duck debugging
- Communication
Do you have any other topics should be covered in the session? Please write me an email! (ohta@cl.uni-heidelberg.de
)