Computing | Weecology Wiki

Mon, 01 Jan 0001 00:00:00 +0000

Running Open Drone Map on HiPerGator

Following instructions in containers:

Pull the container

Here, we’ve created a folder (on blue) to store the container image:

module load aptainer
srun -t 6:00:00 apptainer pull /blue/ewhite/your_user_id/odm/odm.sif docker://opendronemap/odm

Setup directories

Setup directories. ODM needs the working folder code to exist, and a subfolder called images:

[your_user_id@login12 odm]$ tree
.
├── code
│ ├── images
│ │ ├── DSC00001.JPG
│ │ └── DSC00002.JPG
└── run.slurm

We’ll create run.slurm next:

Setup SLURM

We need to bind a working directory where the outputs + images will go. You’d think you could just bind to /code, which is what ODM would like, but that doesn’t work because the EntryPoint of the container is /code/run.py. So we need to set our working directory somewhere else. This is analogous to docker -v:

#!/bin/bash
#SBATCH --job-name=odm-node
#SBATCH --nodes=1
#SBATCH --partition=hpg-turin
#SBATCH --cpus-per-task=16
#SBATCH --mem=64GB
#SBATCH --time=12:00:00
#SBATCH --gpus=1
#SBATCH --output=./slurm_logs/%A.out
#SBATCH --error=./slurm_logs/%A.err

printenv | grep -i slurm | sort

srun apptainer run --bind "${PWD}:/project" /blue/ewhite/your_user_id/odm/odm.sif --project-path /project --max-concurrency 16

Save this as run.slurm and make sure you’ve added any other SBATCH arguments like the correct user/group IDs.

Copy data

Copy some drone images and go for a walk while rsync runs:

rsync -avz --progress /data/everglades/Flight_1/DCIM/ hpg:/home/your_user_id/code/odm/code/images

For larger jobs, you should use some scratch storage space on blue. You can delete the images from the working directory afterwards.

Run

Launch and check the job was submitted:

sbatch run.slurm
squeuemine

Check logs

Then tail the log while it runs to confirm it’s doing something sensible:

[your_user_id@login12 odm]$ tail -f ./slurm_logs/20526676_4294967294.out
2025-12-08 21:50:58,774 DEBUG: Found 10000 points in 4.388810873031616s
2025-12-08 21:50:59,008 INFO: Extracting ROOT_DSPSIFT features for image DSC00077.JPG
2025-12-08 21:50:59,219 INFO: Extracting ROOT_DSPSIFT features for image DSC00098.JPG
2025-12-08 21:51:04,036 DEBUG: Found 10000 points in 4.76168966293335s
2025-12-08 21:51:04,185 DEBUG: Found 9406 points in 5.121237516403198s
2025-12-08 21:51:04,482 INFO: Extracting ROOT_DSPSIFT features for image DSC00108.JPG
2025-12-08 21:51:04,600 INFO: Extracting ROOT_DSPSIFT features for image DSC00003.JPG
2025-12-08 21:51:08,902 DEBUG: Found 10000 points in 4.365730285644531s
2025-12-08 21:51:09,338 INFO: Extracting ROOT_DSPSIFT features for image DSC00092.JPG
2025-12-08 21:51:09,627 DEBUG: Found 9079 points in 4.97150993347168s
2025-12-08 21:51:10,010 INFO: Extracting ROOT_DSPSIFT features for image DSC00141.JPG
....

[INFO] No more stages to run
[INFO] MMMMMMMMMMMNNNMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMNNNMMMMMMMMMMM
[INFO] MMMMMMdo:..---../sNMMMMMMMMMMMMMMMMMMMMMMMMMMNs/..---..:odMMMMMM
[INFO] MMMMy-.odNMMMMMNy/`/mMMMMMMMMMMMMMMMMMMMMMMm/`/hNMMMMMNdo.-yMMMM
[INFO] MMN/`sMMMMMMMMMNNMm/`yMMMMMMMMMMMMMMMMMMMMy`/mMNNMMMMMMMMNs`/MMM
[INFO] MM/ hMMMMMMMMNs.+MMM/ dMMMMMMMMMMMMMMMMMMh +MMM+.sNMMMMMMMMh +MM
[INFO] MN /MMMMMMNo/./mMMMMN :MMMMMMMMMMMMMMMMMM: NMMMMm/./oNMMMMMM: NM
[INFO] Mm +MMMMMN+ `/MMMMMMM`-MMMMMMMMMMMMMMMMMM-`MMMMMMM:` oNMMMMM+ mM
[INFO] MM..NMMNs./mNMMMMMMMy sMMMMMMMMMMMMMMMMMMo hMMMMMMMNm/.sNMMN`-MM
[INFO] MMd`:mMNomMMMMMMMMMy`:MMMMMMMNmmmmNMMMMMMN:`hMMMMMMMMMdoNMm-`dMM
[INFO] MMMm:.omMMMMMMMMNh/ sdmmho/.`..`-``-/sddh+ /hNMMMMMMMMdo.:mMMM
[INFO] MMMMMd+--/osss+:-:/` ```:- .ym+ hmo``:-` `+:-:ossso/-:+dMMMMM
[INFO] MMMMMMMNmhysosydmNMo /ds`/NMM+ hMMd..dh. sMNmdysosyhmNMMMMMMM
[INFO] MMMMMMMMMMMMMMMMMMMs .:-:``hmmN+ yNmds -:.:`-NMMMMMMMMMMMMMMMMMM
[INFO] MMMMMMMMMMMMMMMMMMN.-mNm- //:::. -:://: +mMd`-NMMMMMMMMMMMMMMMMM
[INFO] MMMMMMMMMMMMMMMMMM+ dMMN -MMNNN+ yNNNMN :MMMs sMMMMMMMMMMMMMMMMM
[INFO] MMMMMMMMMMMMMMMMMM`.mmmy /mmmmm/ smmmmm``mmmh :MMMMMMMMMMMMMMMMM
[INFO] MMMMMMMMMMMMMMMMMM``:::- ./////. -:::::` :::: -MMMMMMMMMMMMMMMMM
[INFO] MMMMMMMMMMMMMMMMMM:`mNNd /NNNNN+ hNNNNN .NNNy +MMMMMMMMMMMMMMMMM
[INFO] MMMMMMMMMMMMMMMMMMd`/MMM.`ys+//. -/+oso +MMN.`mMMMMMMMMMMMMMMMMM
[INFO] MMMMMMMMMMMMMMMMMMMy /o:- `oyhd/ shys+ `-:s-`hMMMMMMMMMMMMMMMMMM
[INFO] MMMMMMMMNmdhhhdmNMMM` +d+ sMMM+ hMMN:`hh- sMMNmdhhhdmNMMMMMMMM
[INFO] MMMMMms:::/++//::+ho .+- /dM+ hNh- +/` -h+:://++/::/smMMMMM
[INFO] MMMN+./hmMMMMMMNds- ./oso:.``:. :-``.:os+- -sdNMMMMMMmy:.oNMMM
[INFO] MMm-.hMNhNMMMMMMMMNo`/MMMMMNdhyyyyhhdNMMMM+`oNMMMMMMMMNhNMh.-mMM
[INFO] MM:`mMMN/-sNNMMMMMMMo yMMMMMMMMMMMMMMMMMMy sMMMMMMMNNs-/NMMm`:MM
[INFO] Mm /MMMMMd/.-oMMMMMMN :MMMMMMMMMMMMMMMMMM-`MMMMMMMo-./dMMMMM/ NM
[INFO] Mm /MMMMMMm:-`sNMMMMN :MMMMMMMMMMMMMMMMMM-`MMMMMNs`-/NMMMMMM/ NM
[INFO] MM:`mMMMMMMMMd/-sMMMo yMMMMMMMMMMMMMMMMMMy sMMMs-/dMMMMMMMMd`:MM
[INFO] MMm-.hMMMMMMMMMdhMNo`+MMMMMMMMMMMMMMMMMMMM+`oNMhdMMMMMMMMMh.-mMM
[INFO] MMMNo./hmNMMMMMNms--yMMMMMMMMMMMMMMMMMMMMMMy--smNMMMMMNmy/.oNMMM
[INFO] MMMMMms:-:/+++/:-+hMMMMMMMMMMMMMMMMMMMMMMMMMNh+-:/+++/:-:smMMMMM
[INFO] MMMMMMMMNdhhyhdmMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMmdhyhhmNMMMMMMMM
[INFO] MMMMMMMMMMMMMMMNNNNNMMMMMMNNNNNNMMMMMMMMNNMMMMMMMNNMMMMMMMMMMMMM
[INFO] MMMMMMMMMMMMMh/-...-+dMMMm......:+hMMMMs../MMMMMo..sMMMMMMMMMMMM
[INFO] MMMMMMMMMMMM/ /yhy- sMMm -hhy/ :NMM+ oMMMy /MMMMMMMMMMMM
[INFO] MMMMMMMMMMMy /MMMMN` NMm /MMMMo +MM: .` yMd``` :MMMMMMMMMMMM
[INFO] MMMMMMMMMMM+ sMMMMM: hMm /MMMMd -MM- /s `h.`d- -MMMMMMMMMMMM
[INFO] MMMMMMMMMMMs +MMMMM. mMm /MMMMy /MM. +M/ yM: `MMMMMMMMMMMM
[INFO] MMMMMMMMMMMN- smNm/ +MMm :NNdo` .mMM` oMM+/yMM/ MMMMMMMMMMMM
[INFO] MMMMMMMMMMMMNo- `:yMMMm `:sNMMM` sMMMMMMM+ NMMMMMMMMMMM
[INFO] MMMMMMMMMMMMMMMNmmNMMMMMMMNmmmmNMMMMMMMNNMMMMMMMMMNNMMMMMMMMMMMM
[INFO] ODM app finished - Mon Dec 08 22:13:26 2025
.100 - done.
(ASCII art is fun)

The output will be in odm_orthophoto/odm_orthophoto.tif

Useful flags and other run options

Store high-precision geotags from PPK in geo.txt in the project folder and ODM will pick them up automatically (see here)

"""Convert WISPR PPK coords to OpenDroneMap"""

import argparse
import csv
from pathlib import Path

parser = argparse.ArgumentParser(description="Convert WISPR CSV to ODM geo.txt")
parser.add_argument("root_dir", help="Root directory to search for exif_image_list.csv")
parser.add_argument("output_txt", help="Output geo.txt file for ODM")
args = parser.parse_args()

csv_files = list(Path(args.root_dir).rglob("exif_image_list.csv"))
if len(csv_files) == 0:
 raise SystemExit("Error: No exif_image_list.csv found")
if len(csv_files) > 1:
 raise SystemExit(f"Error: Found multiple exif_image_list.csv files:\n" + "\n".join(str(f) for f in csv_files))

with open(csv_files[0]) as infile, open(args.output_txt, "w") as outfile:
 reader = csv.DictReader(infile)
 outfile.write("EPSG:4326\n")

 for row in reader:
 # ODM format: filename lon lat alt yaw pitch roll [h_acc v_acc]
 outfile.write(
 f"{row['image_name']} {row['longitude']} {row['latitude']} "
 f"{row['altitude']} {row['gimbal_yaw']} {row['gimbal_pitch']} "
 f"{row['gimbal_roll']} {row['x_accuracy']} {row['z_accuracy']}\n"
 )

print(f"Created {args.output_txt}")

Set --max-concurrency to the number of CPUs assigned to the job
--optimize-disk-space to clean up intermediate files at the cost of needing to re-run the job if it crashes (no resume)
--orthophoto-resolution 2 ortho resolution in centimeters. The lower this parameter, the larger the output ortho will be and the slower the processing. The default is 5 which runs fairly quickly and is a good sanity check.
Use GCPs

Array processing.

Here is an example job file that can run on an array of surveys. The list of input folders is provided in folder_list.txt (newline delineated).

#!/bin/bash
#SBATCH --job-name=odm-array
#SBATCH --nodes=1
#SBATCH --partition=hpg-turin
#SBATCH --cpus-per-task=16
#SBATCH --mem=64GB
#SBATCH --time=12:00:00
#SBATCH --gpus=1
#SBATCH --array=1-N%4
#SBATCH --output=./slurm_logs/%A_%a.out
#SBATCH --error=./slurm_logs/%A_%a.err

# Configuration
INPUT_FILE="folder_list.txt" # Text file with one absolute path per line
WORKING_DIR= # Where you want to create the output folders
ODM_SIF=# Path to the SIF file

# Get the source folder for this array task
SOURCE_FOLDER=$(sed -n "${SLURM_ARRAY_TASK_ID}p" "$INPUT_FILE")

# Get the basename
BASENAME=$(basename "$SOURCE_FOLDER")

# Create target directory structure
TARGET_DIR="${WORKING_DIR}/${BASENAME}"

# Print environment for debugging
printenv | grep -i slurm | sort

mkdir -p "${TARGET_DIR}/code/images"

# Copy JPG files
echo "Copying images from ${SOURCE_FOLDER} to ${TARGET_DIR}/code/images"
rsync -av --include='*.JPG' --include='*.jpg' --exclude='*' "${SOURCE_FOLDER}/" "${TARGET_DIR}/code/images/" || \
echo "Warning: No JPG files found in ${SOURCE_FOLDER}"

# Perform PPK geotagging
python3 wispr_to_odm_ppk.py ${SOURCE_FOLDER} ${TARGET_DIR}/geo.txt || \
echo "Failed to find a PPK coordinate file. Processing will use EXIF GPS data only."

module load cuda

# Run ODM with the target directory as project path
echo "Running ODM on ${TARGET_DIR}"
srun apptainer run --nv --bind "${TARGET_DIR}:/project" \
 "$ODM_SIF" \
 --project-path /project \
 --max-concurrency 16 \
 --orthophoto-resolution 2 \
 --optimize-disk-space

# Clean up the images in the target folder
echo "Removing image folder ${TARGET_DIR}/code/images"
rm -rf "${TARGET_DIR}/code/images"

echo "Setting permissions on target foler"
bash group-permissions-update.sh ${TARGET_DIR}

echo "Completed processing ${BASENAME}"

The param #SBATCH --array=1-N%4 starts an array job with a maximum of 4 concurrent tasks.
Note module load cuda to pass GPU to job

Submission script:

#!/bin/bash

INPUT_FILE="folder_list.txt"
N=$(wc -l < "$INPUT_FILE")

sbatch --array=1-$N job_array.slurm

Accessing the T-drive

Mon, 01 Jan 0001 00:00:00 +0000

If off campus, turn on the virtual private network (VPN) before connecting using the instructions here: https://vpn.ufl.edu/
Follow the instructions for: 1) Windows or 2) macOS
On Linux use the same basic approach described in the Windows/macOS instructions, but enter UFAD for the WORKGROUP
Lab materials are in the lab-white-ernest folder
Our public file serving folder is Weecology

Linux

Assuming you’re running Ubuntu. Install the cifs-utils package, create a folder called /media/T for a mount point, then run the mount command, replacing <your-gatorlink> with your UF gatorlink (the part of your UF email address before the @ and <your-local-username> with your username on the Linux machine you are performing the mount from; you can check this with whoami). If you haven’t run sudo recently you will be prompted for your local password. You will (then) be prompted for UF your password.

# 1) Install CIFS support
sudo apt-get install -y cifs-utils

# 2) Create the mount point (if it doesn't exist)
sudo mkdir -p /media/T

# 3) Unmount if already mounted (lazy to avoid "busy" errors)
sudo umount -l /media/T 2>/dev/null || true

# 4) Mount with your user/group mapping and permissions. Customize permissions to your need
# - file_mode=0660 -> files: rw-rw----
# - dir_mode=0770 -> folders: rwxrwx---
# - Replace <your-gatorlink> and <preferred-local-group>
sudo mount -t cifs LINK_FROM_INSTRUCTIONS_ABOVE_WITHOUT_SMB_PART /media/T \
 -o username=<your-gatorlink>,uid=$(id -u),gid=$(getent group <preferred-local-group> | cut -d: -f3),file_mode=XXXX,dir_mode=XXXX

The ,gid=<preferred-local-group> can be left out if no group is needed, but on shared systems (like Serenity) we typically want to include a group so everyone can access the share. The full command, including the LINK_FROM_INSTRUCTIONS_ABOVE_WITHOUT_SMB_PART, is available for weecology folks as a pinned post on the Serenity and HiPerGator Slack channels.

Lab Style Guide for Code

Mon, 01 Jan 0001 00:00:00 +0000

Guiding Principles

This document provides a guide to code structure and formatting across languages used within the Weecology projects. Links to language-specific guides are provided below.

Generally, this guide follows the principles outlined here. In particular:

Write and style your code for human readers
Minimize the number of facts a reader is expected to hold at one time
Use consistent, distinct, and meaningful names
Employ consistent style and formatting

Structure

Modularize

Break code into chunks corresponding to contained tasks
Whenever possible write code into functions, even if the function isn’t called repeatedly

Loops

Loop should be used with repeated tasks unless you actually need the indices
If the language allows, use vectorized functions in place of loops to speed computation and reduce code volume

Style

Naming

Be concise and meaningful, but conveying meaning is more important than brevity
- Document abbreviations if they are not common or immediately intuitive
- Functions are verbs, variables are nouns
Use snake_case for variables and functions (e.g., portal_data)
- Exceptions:
  - established prefixes (e.g., n in nobs to indicate the number of observations)
  - established suffixes (e.g., i in obsi to indicate the specific observation in a for loop)
Use UpperCamelCase for class names for object oriented programming (primarily in Python)
Do not use . in names (particularly in R)
Do not use single-letter names
- Exceptions:
  - representing a term in an equation (e.g., y in y = m * x + b)
  - using an established name in a language (e.g., n references the number of draws from a random variable in R)
Constants, and only constants, should be in all caps

White space

spaces after commas
spaces around operators (unless inside the argument definitions in Python)
no spaces around parentheses

Line length

Lines <= 80 characters
- But a few extra characters can be better than confusing contortions to make length
Parentheses/brackets/braces with breaks after commas are typically better than line break characters (but not always)

Indentation

Always indent to indicate that code is inside a function, loop, etc. (Python makes you do this. Thanks Python!)
Use spaces, not tabs (but it’s fine for you IDE to turn the tab key into spaces)
Follow language convention for number of spaces
- R: 2
- Python: 4
When breaking lines with parentheses (mostly in function calls/definitions) align with the leading character after the opening parenthesis
```
(stuff, things,
 more_things)
```

References

Magic numbers (numeric references to elements, columns, rows, etc.) should be avoided.
References should be made by name or-in the case of loops-position.

Documentation

In-line commenting

Document what the code does and how to use it, not how it does it

External documentation

Use standard documentation comment styles
- R: roxygen
- Python: docstrings
- These can create formatted documentation, but they are useful visual indicators even if you don’t do this

Language specific style guides

Follow official language style guides (within reason). This helps make your code broadly readable and makes external contributions more likely.
- Python: Official Python Style Guide (PEP8)
- Julia: Official Julia Style Guide
- R: Hadley Wickham’s style guide. This isn’t official, or broadly agreed on, but it serves as the base (or at least justification) for a lot of what we do

Computer Setup - Mac

Mon, 01 Jan 0001 00:00:00 +0000

Terminal

This is an application on Macs. It is opened by double-clicking on the icon, which should bring up a small window with a black background. The Terminal is used to move around in the file directory, rearrange files, and use git.

Python

IDE

Packages

Projects

Links to get started coding in Python

Using git/GitHub

R

Installing R

Installing RStudio

Packages

Projects

Using git/GitHub

Dependencies

Git/GitHub

Resources: Software Carpentry, Happy Git and GiHub for the userR, Roger Dudler cheatsheet

Installing git

Open up the Terminal, type in “git” and press enter.
This should cause a pop-up window to appear. It will have several options; click on “Install” (not “Get Xcode”, see “Installing Xcode” for that).
Click “Agree”.
When the install is finished, click “Done”.
To make sure this worked, type in “git” in the Terminal and press enter. Some information will come up, including a list of common commands.

Configuring Git

There’s some basic info on Git setup from software carpentry. If you are also setting up a GitHub account, be sure to use the same email address, so that when you use Git on your computer and push the changes to GitHub, it identifies you correctly.

On a mac, browsing folders in finder also tends to generate .DS_Store files. You generally don’t want to include those in your repositories, so here are some instructions to ignore such files globally.

Installing Xcode

GitHub

Create a GitHub account by going to https://github.com/

This is a service that allows you to store your code, project management materials, etc., online. It allows for other people to look at your work (if the repo is public). It also saves all the versions of your code as you change it, which is referred to as version control. This eliminates the need for creating multiple copies of code as you change it (e.g., folder with files called “data_analysis”, “data_analysis_3”, “data_analysis_45”, “data_analysis_final”, and “data_analysis_final_really”) and you can use the online GitHub interface to easily look back at previous versions of your code and see what was changed.

Repositories

A repository is where you put all of the materials related to single project. One repository per project.

Creating a GitHub repository:

Open up a browser, go to the GitHub website, and sign into your GitHub account. Navigate to your profile page.
Click on the “Repositories” tab in the top middle of the page.
In the upper right hand corner, click on the green “New” button.
In this page, name the repo, preferably something short and succinct that uniquely describes the project. Also, there should be no spaces in repository names. If the name consists of multiple words, separate them by an underscore, dash, or camel case, e.g., mammal_community_dynamics, mammal-community-dynamics, MammalCommunityDynamics.
Select if the repo will be public or private. Keep in mind that you have a limited number of private repos for free, so most of your repos will likely be public. Having your research publicly available also makes for better science.
Check “Initialize this repository with a README”.
Leave both “Add .gitignore” and “Add a license” as “None”.
Click green “Create repository” button. You’ve created your new repository, congrats!

Cloning GitHub repository:

The repository that was created above is the remote repository. This can be accessed from any computer using the browser. You also need to create a copy of the remote repository called the local repository. This copy of the repo can only be accessed from the computer it is created on. You will do work (e.g., changing code) on the local repo.

In the browser, navigate to the main page of the repository you want to clone.
In the right-hand column near the bottom, there is a bar containing a URL. There are two possible options for this URL, either HTTPS or SSH. You can switch between these two by clicking on the relevant blue hyperlinked acronym below the URL.
Click on the HTTPS blue hyperlinked word. Either HTTPS or SSH can be used, but it is easier to start with HTTPS. The difference between these is how they link the local repo to the remote, but that difference is not important right now.
Copy the HTTPS URL.
Open up the Terminal and navigate to the location on your computer where you want the local repo to be located. You can navigate around using the commands “ls” (this displays all the folders and files in the current directory) and “cd”. This latter command changes the directory, so you will type in the path for the directory you want to go to. For example, if I want to put the repo in the folder Projects, which is within the folder Documents, I would type the following into the Terminal: “cd Documents/Projects”. Then hit enter.
Once you’re in the directory where you want the repo to be located, type in the command “git clone”, a space, and then paste in the HTTPS URL.
Hit enter. This should create a new folder in the chosen directory that has the same name as the remote repo. This is your local repository.

Adding files and commits to local repository:

An important aspect of git/GitHub is version control. As you change scripts, you can use git to save all the different versions of scripts and then use GitHub later on to easily look at each of these versions. Then, if you mess up your code or don’t like the direction it’s heading in, you can access and use a previous version of your script very easily.

The way that you save these versions using git is by doing something called a commit. Each commit represents a different version of a script. Because you choose when to do a commit, you get to choose how different all of the versions of the script are. You should definitely make a commit for every major change in the code that you make, but you can never commit too often. When in doubt, commit.

Another important, and confusing, step in this process is adding the script. Before you can commit the newest version of a script, you have to add that script to the stage. This means that if you’ve changed several of the scripts within one local repository, you can add all of these scripts to the stage and commit them together, or you can add one script to the stage and commit it at a time.

Create a new script (in Python, R, or whatever language) and save the script in the folder for the local repository you’ve just created. Similar to the names of repositories, there should be no spaces in script names. See the fourth bullet point in the “Creating a GitHub Repository” section for naming conventions.
If you are not already there, open up the Terminal and navigate to the local repository folder using the “cd” command.
To add the script, type in “git add " and then the name of the script, and then hit enter. (There should be a space between the add command and the script name). The script is now on the stage. Optional: You can repeat this multiple times in a row with different script names if you’re adding multiple scripts to the stage at the same time.
To make sure that the script has been added, type in “git status” and hit Enter. This will bring up information about the repo that you are currently looking at. If you’ve correctly added the script, under the “Changes to be committed:” header, there should be an indented bit of text that will be formatted as “modified: " and the name of the script you’ve added.
Now you want to commit this version of the script. In the Terminal, type in “git commit -m “message” and hit enter. The message is where you will insert a succinct, informative description of what changed between the last version and this newest version of the script. Writing good commit messages is a bit of an art, but there is some information [here] (http://chris.beams.io/posts/git-commit/) on good commit messages.
You can use git status again to check that the commit worked. Type in “git status” and hit Enter. Now the entire “Changes to be committed:” section should be gone, because there should no longer be any changes that haven’t been committed in this repo.
You can look at this commit, and all previous commits, by typing in “git log” and hitting Enter. This will bring up a list of the commits, with the most recent commit at the top. The information about each commit includes the author, date, and message of the commit. At the top of each commit, there is a long string of letters and numbers. This is the hash, or unique identifier, for each commit.
You will do this add and commit workflow (steps 2-7) each time you make a substantial change to this script, or when you want to add another script.

Pushing and pulling:

TODO: add summary here (about getting changes up to GitHub repository)

Containers

Mon, 01 Jan 0001 00:00:00 +0000

Containers are a tool for running code in self-contained environments.

Debugging r-devel using Docker

First install docker for your operating system.

Then get an r-devel container:

sudo docker pull rocker/drd

The run docker interactively while mounting your working directory:

sudo docker run -v WORKING/DIRECTORY:/mnt/WORKDIR -it rocker/drd:latest

Replace WORKING/DIRECTORY with the path to the directory whose files you want to access and WORKDIR with whatever you want it to be named inside the /mnt/ directory in the container.

This will open an interactive R console running r-devel.

Using containers in VS Code

If you use VS Code as your IDE you can develop inside a container. To setup this up follow the official instructions, which can be summarized as:

Install docker for your OS with non-sudo access (on Linux add your user to the docker group and logout)
Install the Dev Containers extension
Create a .devcontainers/devcontainer.json file to your project directory indicating which container to use, e.g,.

{
 "name": "r-devel",
 "image": "rocker/drd:latest"
}

This can be a little tricky to setup (don’t be afraid to ask for help), but when you open the project in VS Code you’ll be automatically working in the designated container.

Using containers on HiPerGator

While many developers are familiar with Docker, HiPerGator uses a slightly different container system called apptainer. Apptainer can run Docker containers by first converting them to a Singularity image (.sif), which is similar to an ISO if you’ve ever burned a disk or created a bootable USB drive. There is some convenience here that the image of the VM is an obvious portable file on disk, while with Docker it’s not always obvious where things are stored. Interacting with this image file is very similar to Docker. You can read more information about Apptainer here.

Example: Open Drone Map

In this example we’ll use Open Drone Map (ODM). ODM can be run via a docker container to avoid a fairly complex installation process.

First, we need to pull an image. On the cluster, we need to load the apptainer module first. Then we have to pull the image. This may take a while, so use srun or sbatch with a job file. The first argument is the (optional) path to the created image, the second argument is the docker repo ID:

module load aptainer
srun -t 1:00:00 apptainer pull /path/to/odm.sif docker://opendronemap/odm

Launchihng the container

To run a container, call srun apptainer run which should run the entrypoint in the container. Like Docker, you can also launch a shell or execute any program that’s installed in the container (docker exec -> apptainer exec).

Mounting local directories

The most common argument you’ll want is --bind which is similar to -v in Docker and lets you mount a local filesystem in the container. In our example, ODM will look for input files in the /project folder. If we create folder called data/working_dir, we can then bind this directory to /project like --bind "data/working_dir:/project" (--bind "host_path:container_path")

#!/bin/bash

... # Other SBATCH options

#SBATCH --job-name=odm-node
#SBATCH --nodes=1
#SBATCH --partition=hpg-turin
#SBATCH --cpus-per-task=16
#SBATCH --mem=64GB
#SBATCH --time=12:00:00
#SBATCH --gpus=1


module load apptainer

srun apptainer run --bind "${PROJECT_FOLDER}:/project" odm.sif --project-path /project --max-concurrency 16 --fast-orthophoto

Don’t forget to add any other SBATCH flags you need.

NVIDIA inside containers

Use the --nv flag to pass NVIDIA GPUs to the container. This requires CUDA to be installed on the host system (i.e. module load cuda first).

Create an orthomosaic in Agisoft

Mon, 01 Jan 0001 00:00:00 +0000

Orthomosiac Guide

Written by Ben Weinstein, September 13, 2022

The goal of this wiki is to document the steps to create a Orthomosaic from a set of UAV raw images. This relates to the everglades project with both Inspire Quadcopter and Wingtra Drones, but can serve as a general guide for the steps to create a georeferenced image.

Agisoft

This tutorial uses Agisoft Metashape Pro. Future work needs to determine whether the standard version will suffice. We are following the manual here: https://www.agisoft.com/pdf/metashape-pro_1_7_en.pdf This tutorial is quick and dirty in the sense that we are choosing low quality outputs to speed up the workflow.

Load Images

Raw images look like:

Workflow -> Add Folder

Align Photos

Align photos places the images from a single physical camera into a common reference system. It creates a sparse point cloud of overlap to generally tell where the images are in space.

To view the coordinate system and alignment, view the ‘reference’ tab in the bottom left.

Set Coordinate Reference System

We are confident that the Inspire knows its geospatial accuracy to about 10m, and the pitch degree to about 1m.

Set Marker Points

If we don’t have ground control points we can skip this step.

Build Dense Cloud

Workflow -> Build Dense Cloud

The dense point cloud is required to build an elevation model to convert the 2D images into a 3d surface.

Build Digital Elevation Model

The elevation model projects the images into 3D space and is saved as a seperate .tif file.

Workflow -> Build DEM

Build Orthomosaic

Use the digital elevation model and the dense point cloud to create a single stitched model of the entire colony.

File Compression Notes

Mon, 01 Jan 0001 00:00:00 +0000

Efficient large volume (de)compression

When archiving large volumes of data using parallel and highly efficient algorithms can be useful. We most commonly do this when archiving old projects on the HPC.

On Linux (and our HPC) one of the easy ways to do this is with tar with zstd compression.

tar --use-compress-program=zstd -cvf my_archive.tar.zst /path/to/archive

If you need to pass arguments to zstd they can be included in quotes, 'zstd -v'.

To uncompress these archives:

tar --use-compress-program=unzstd -xvf my_archive.tar.zst

Ignoring failed reads using tar

When archiving files with tar the archive will fail if any file cannot be read by the account doing the archiving. This is a common occurrence we archiving on the HPC and the files are often (but not always) hidden files that don’t need to be archived (but definitely check to make sure). You can ignored these failed reads using the --ignore-failed-read flag.

tar --ignore-failed-read -cvf my_archive.tar.zst /path/to/archive

Fixing a corrupted zip file

Using zip

If you try to open a zip file and it won’t unzip you can often fix it by rezipping the file (source).

First, try:

zip -F corrupted.zip --out fixed.zip

If that doesn’t work try:

zip -FF corrupted.zip --out fixed.zip

If you receive an error message like:

zip error: Entry too big to split, read, or write (Poor compression resulted in unexpectedly large entry - try -fz)

then:

Make sure you have at least version 3.0 of zip
Try adding -fz to the command

zip -FF -fz corrupted.zip --out fixed.zip

Using p7zip

If none of this works try p7zip, which can be installed using conda.

conda create -n p7zip python=3
conda activate p7zip
conda install -c bioconda p7zip

This version is pretty out of date, but much less out of date than the one in the HiperGator module system. The one in the HiperGator module is too old to solve the problems we’ve seen.

7za x corrupted.zip

Note that this will decompress into the current working directory not into ./corrupted/

Increasing compression

There is a tradeoff between how long it takes to compress something and how much smaller gets. When using zip this is controlled by a numeric argument ranging from 1 (faster) to 9 (smaller). So, if you’re archiving large objects try using zip -9.

If you have a bunch of already zipped files you can recompress them using the following bash loop:

for f in *.zip
do
 mkdir ${f%.*}
 unzip -d ${f%.zip} $f
 rm $f
 rm ${f%.*}/${f%.*}.csv
 zip -r -9 $f ${f%.zip}
done

Geospatial Computing from the Command Line

Mon, 01 Jan 0001 00:00:00 +0000

Installation

The commands on this page use gdal and jq.

Conda

You can install these packages on any operating system using conda/mamba.

mamba create -n my-gdal-env python=3 gdal jq

Then activate the environment every time you want to work with gdal.

mamba activate my-gdal-env

Linux

On Ubuntu these can be installed using:

sudo apt install gdal-bin jq

Get information about a raster

gdalinfo provides information about raster files.

gdalinfo myraster.tif will produce a basic readable output to the screen.

This output can also be written to JSON

gdalinfo -json myraster.tif

Writing to JSON makes it easy to use individual pieces, e.g., to look up the dimensions of the raster (you’ll need to install jq to do this).

gdalinfo -json myraster.tif | jq -r .size

or just the width of the raster

gdalinfo -json myraster.tif | jq -r .size[0]

Splitting rasters using gdal

One way to split a raster into pieces is to use the gdal_retile.py Python script bundled with gdal.

The following command will split myraster.tif into 1500x1500 pixel rasters stored in outputdir. The first number is the width (in pixels) and the second is the height (in pixels) of each chunk.

gdal_retile.py -ps 1500 1500 -targetDir outputdir myraster.tif

Files will be labeled with _row_col and so if the original image was 4500x1500 then the above command would produce three output files:

myraster_1_1.tif
myraster_2_1.tif
myraster_3_1.tif

Representing the top of the original raster (_1_1), the middle of the original raster (_2_1), and the bottom of the original raster (_3_1).

Split raster into horizontal strips

Our most common usage is to split large rasters into horizontal strips with manageable file sizes (< 3 GB). This can be automated by changing myraster.tif to the location of your raster and outputdir to the directory you want the split raster pieces stored in and running the code below:

RASTER=myraster.tif
OUTPUTDIR=outputdir
WIDTH=$(gdalinfo -json $RASTER | jq -r .size[0])
WIDTHPAD=$((WIDTH + 10)) # Padding prevents periodic inclusion of single pixel strip 
HEIGHT=$(expr 1000000000 / $WIDTH)
gdal_retile.py -ps $WIDTHPAD $HEIGHT -targetDir $OUTPUTDIR $RASTER

Combining/merging rasters using gdal

Merging rasters

One way to combine rasters is to use the gdal_merge.py Python script bundled with gdal. We use LZW compression to reduce file sizes while ensuring that the resulting GeoTIFF can be used in all geospatial computing systems.

gdal_merge.py -o output_file.tif input_file_1.tif input_file_2.tif input_file_3.tif
gdal_translate -co COMPRESS=LZW -co PREDICTOR=2 -co BIGTIFF=YES output_file.tif compressed_file.tif

We use this two command approach instead of including compression in the merge command because gdal_merge.py doesn’t currently support BigTIFF creation correctly and many of our combined files are greater than the 4GB maximum for regular TIFFs. PREDICTOR=2 produces more efficient compression in the presence of spatial autocorrelation, which we often have.

Virtually combining rasters

Instead of actually merging the rasters you can create a virtual raster in a vrt file. This file includes metadata on the positions of all of the rasters, which can be loaded into a GIS and viewed like a single raster.

gdalbuildvrt virtual_combined_raster.vrt *.tif

Removing alpha channels

All of our code works with 3 band RGB rasters. Occasionally we accidentally produce a raster that contains a 4th alpha channel. This can be removed using GDAL.

First check to make sure the bands you want are the first three bands (they pretty much always are):

gdalinfo four_band_ortho.tif

This should show something like the following info about channels:

Band 1 Block=1501x1 Type=Byte, ColorInterp=Red
Band 2 Block=1501x1 Type=Byte, ColorInterp=Green
Band 3 Block=1501x1 Type=Byte, ColorInterp=Blue
Band 4 Block=1501x1 Type=Byte, ColorInterp=Alpha

The use gdal_translate.py to just keep the first three bands:

gdal_translate -b 1 -b 2 -b 3 four_band_ortho.tif three_band_ortho.tif

Original source

Collaborating Using Git & GitHub

Mon, 01 Jan 0001 00:00:00 +0000

Intro

This is intended to be a default set of procedures for Weecologists to collaborate together using a Git/GitHub repository. For projects that are primarily being worked on by one person, this is probably unnecessary, but you may want to follow this anyway, so as to ingrain the workflow practices.

Setup

If you haven’t done so already, please check out the onboarding section with links to Git and Github resources.

In this guide, we presume that there is a single repo on GitHub and multiple users, who work on clones of that repo (on their local machines), and interface through GitHub.

Branching

One way of thinking about git branches is that each branch represents a “lineage” of commits in a repo. By default, git repos have a master branch, and adding commits to a new repo will create iterative versions of the project, all considered to be part of the master branch.

You can see the branches in your project using git branch from the command line while in the folder with a git repo. This will list the branches in the repo:

~/projects/portalr > git branch

 biomass-function
 hao-data-vignette
* master

Here, master is marked with an asterisk (and possibly a different color) to indicate that it is the “active” branch. What this means is that new commits added to the repo will be derived from the end of the master branch and included as part of that branch.

Making New Branches

We can create new branches by specifying a new branch name when using the git branch command. This allows us to start a new “lineage” of commits from the current state of the repo.

~/projects/portalr > git branch hao-test-branch

When we look at the branches, we now see:

~/projects/portalr > git branch

 biomass-function
 hao-data-vignette
 hao-test-branch
* master

Notice that the active branch is still “master”.

Switching Branches

To change the active branch, we use the git checkout command:

~/projects/portalr > git checkout hao-test-branch

Switched to branch 'hao-test-branch'

This is what it looks like when we run git branch afterword:

~/projects/portalr > git branch

 biomass-function
 hao-data-vignette
* hao-test-branch
 master

Pushing to GitHub

After we have created a branch on our local clone of the repo, and made some commits, we might want to push those commits to GitHub. The first time we do so, however, we encounter an error:

~/projects/portalr > git push

fatal: The current branch hao-test-branch has no upstream branch.
To push the current branch and set the remote as upstream, use

 git push --set-upstream origin hao-test-branch

The reason for this error is that the repo on GitHub does not have the branch hao-test-branch, and commits have to be assigned to a branch. The suggested command does several things at once:

create a branch called hao-test-branch on the GitHub repo (which has the remote name origin)
establish a link between the local branch called hao-test-branch and the GitHub branch called hao-test-branch
push the local commits on hao-test-branch to GitHub.

Pulling from GitHub

Suppose someone starts making an update and has pushed it to GitHub and wants your help before merging it into the master branch. How do you download that new branch?

First, make sure we get all the information from the GitHub repo. This assumes that the GitHub repo is named as the “origin” remote (which is the default).

~/projects/portalr > git fetch origin

We can then view the possible branches using

~/projects/portalr > git branch -r

 origin/biomass-function
 origin/fix-test
 origin/hao-data-vignette
 origin/hao-export-obs-func
 origin/hao-loadData-update
 origin/hao-reorder-args-remove-incomplete-censuses
 origin/master
 origin/namespace_issue
 origin/namespaceissues
 origin/standardize_column_names

We want to create a local branch to mirror the “fix-test” branch:

~/projects/portalr > git checkout -b fix-test origin/fix-test

Branch fix-test set up to track remote branch fix-test from origin.
Switched to a new branch 'fix-test'

This has done several things: it retrieved the branch from GitHub to our local machine, set up tracking, and changed the current active branch. Now, if we make new commits to the local copy of the branch, we are able to push directly to that corresponding branch on GitHub.

Pull Requests

The preference is to use GitHub to merge the updates on a new branch back into master. We can do this by going to the “Pull requests” tab on the GitHub repo page and creating a “New pull request”.

Suppose we want to merge from hao-test-branch into master. Then we select master as the “base: " branch, and hao-test-branch as the “compare: " branch. We can then write some comments for our new pull request before clicking on “Create new pull request”.

If the pull request fixes an issue, you can include keywords to automagically close the issue when the pull request is merged.

Updating Pull Requests

At this point, other people can comment on the pull request itself in GitHub, if discussion regarding the changes needs to occur.

Additionally, assuming that the pull request has not yet been merged, further commits to that branch on GitHub are automatically included with the pull request. Thus, if you later find a bug, you can make further changes and not have to submit a new pull request.

Merging Pull Requests

In general, check with one of the repo maintainers about merging pull requests. This ensures that the master branch doesn’t break (too often) and that everyone is informed about changes.

Summary Example

Objective: I want to fix issue #1 in the https://github.com/weecology/portalr repo.

Download the repo from GitHub and onto my local machine. [git clone]
In my local machine, create a new branch (e.g. hao-add-biomass-function <- prefacing the branch name with your name helps prevent branch name collisions. [git branch]
Switch to the new branch. [git checkout]
Make the updates on my local machine. [git commit]
Push the updates to GitHub. [git push]
Create the pull request on GitHub. [GitHub web interface]
Merge the pull request on GitHub. [GitHub web interface]
On my local machine, switch back to the master branch. [git branch]
Get the updates to the master branch [git pull] (optionally) Delete the branch on GitHub. [GitHub web interface, “Code” tab, “## branches”] (optionally) Delete the branch on my local machine. [git branch -d hao-add-biomass-function]

Git Tips

Mon, 01 Jan 0001 00:00:00 +0000

Deleting all merged branches (locally)

git switch main
git branch --merged | egrep -v "(^\*|master|main|dev)" | xargs git branch -d

Source: https://stackoverflow.com/a/6127884/133513

Pretty git Log

(Hao) I have this in my ~/.gitconfig file to enable the git lg alias on command-line.

[alias]
lg1 = log --graph --abbrev-commit --decorate --format=format:'%C(bold blue)%h%C(reset) - %C(bold green)(%ar)%C(reset) %C(white)%<(40,trunc)%s%C(reset) %C(reverse white)- %an%C(reset)%C(bold yellow)%d%C(reset)' --all
lg2 = log --graph --abbrev-commit --decorate --format=format:'%C(bold blue)%h%C(reset) - %C(bold cyan)%aD%C(reset) %C(bold green)(%ar)%C(reset)%C(bold yellow)%d%C(reset)%n'' %C(white)%s%C(reset) %C(dim white)- %an%C(reset)' --all
lg = !"git lg1"

GitHub Actions

Mon, 01 Jan 0001 00:00:00 +0000

Generic Tutorial

This is an example tutorial on how to set up a GitHub Action workflow. In case the project has a Travis file, most of work would be copying and pasting the command into the different stages of the online template that is provided by GitHub.

https://www.youtube.com/watch?v=F3wZTDmHCFA

Globus for file transfer

Mon, 01 Jan 0001 00:00:00 +0000

Globus is a useful file manager for transferring files between your computer and the HiPerGator, or from place to place on the HiPerGator.

Here is the HiPerGator guide to Globus. It may be more up to date.

Briefly, to use Globus, you need to create an account with your UF login. You may need to get authorization to be added to the UF research computing access. Once you have access, you can use the online Globus interface to set up file transfers from different locations (called “endpoints”) on the HiPerGator.

To transfer files to and from your computer, you need to install the Globus Connect Personal account on your computer and set it up as an “endpoint”. Then you can also set up transfers to and from your computer.

HiPerGator Intro Guide

Mon, 01 Jan 0001 00:00:00 +0000

So you want to run your R or python script on the HiperGator

Introduction

This guide gives a high level overview of how one goes about running R or python scripts on a high performance cluster (HPC). There are no coding examples here, and instead is designed to give someone a frame of reference for how to approach things, and where other more detailed tutorials fit in the larger picture. The expected user is someone who is comfortable doing analysis and writing scripts in R using RStudio, or python with various IDEs, and has no HPC experience.

This is written for users of the UFL HiperGator who code in R or python, but most information will apply to potential users for any HPC system and with any scripting language.

HPC Use cases

There are two scenarios where you may want to run your analysis script on the HPC.

Your code takes a very long time to run
If your code is taking several hours, days or more to run on your personal computer then using an HPC is likely a good option for two reasons. One is the servers on an HPC have processors much faster than desktops and laptops, so with minimal changes your code will run significantly faster. Second is scripts on an HPC run independently of your personal computer, so you can shutdown your personal computer while the script on the HPC runs overnight or over the weekend. HPC systems can have a time limit of several weeks to a month for any single job.
If your script takes so long that it seems like it will never finish then an HPC can be especially beneficial. Say you do a test run on 1% of your data and it takes 2 days to run. Theoretically it will then take 200 days to run on your full dataset. In this case the benefit of the HPC is its parallel processing power.
By default R and python scripts run on a single processor. Most computers today have 4-8 processors though. And HPC servers have upwards of 64. If you spread the work out to multiple processors you can significantly decrease the amount of time it takes to run. For example: a script that takes 1 hour to run can potentially take 0.5 hours with 2 processors, or 15 minutes with 4 processors, or 7.5 minutes with 4 processors, and so on. There is computational overhead with parallel processing though so halving the time with a doubling of the number of processors is only a general rule. This is not a straightforward change to your code though and it will take time. See below about making scripts parallel and whether it’s even worth it.
Your script fills up your computer’s memory and crashes when it runs.
When you have large datasets, such as raster files, it’s easy to use up all the memory and freeze your computer. As opposed to just waiting on long running scripts, in this case it makes it nearly impossible to do analysis. Just like HPC servers have powerful processors, they also have extremely large amounts of memory. Usually greater than 100GB. There is a good chance that they can handle whatever large datasets you throw at them.

Should I bother with an HPC?

Your analysis and data can be of any size. There is no minimum computational requirement to use an HPC. But understand there is a time cost involved with learning how to interact with an HPC and also optimizing your code so it runs most efficiently. Therefore, in some cases it isn’t worth it porting scripts to an HPC system.

Consider an example where you have a script that takes 1 hour on your laptop, and you must run it once a month. It’s likely reasonable to just keep that workflow. But if a script takes 10 hours and you must run it once a week, then it’s worth considering doing it on an HPC. Especially since that will decrease the wear and tear on your laptop and free up that 10 hours for other uses.

Whether it’s worth it or not is unique to every situation. Also remember that once you learn all the HPC basics the first time, then that time cost isn’t needed for your next project.

Also consider that the two use cases described above might also be solvable by code optimization. If you can find a section of code which is slow and make it run fast enough to meet your needs, that is preferable over running the code on an HPC. There is no one solution to this, but a good starting point is Hadley Wickham’s Advanced R tutorial on Performance and Profiling. This 45 minute video also gives a great overview of profiling, optimization, parallel processing, and the implications in R.

What exactly is an HPC?

A high performance cluster (HPC) is primarily two things.

It’s hundreds of individual servers in a data center. Each server is a computer just like your personal computer, but has more powerful components, and does not have a graphical user interface or even a monitor. You interact with the servers via the command line. If you’ve never used the command line consider it like the Rstudio console, or a python prompt, was your only way to interact with a computer. More on this below.
It’s a system for scheduling, prioritizing, and running scripts from hundreds of users. This is how the hundreds of servers can be used as “one”. Access to them is controlled by scheduling programs which you interact with, which then put your scripts in a queue to be run when resources are free. Slurm is probably the most popular scheduler (and the one used on the HiperGator) but some HPC systems may use other ones like PBS or MOAB.

Primary Steps to running your code on an HPC.

You need an account.
Signup for HiperGator HPC account here.
You must be able to login to the HPC via SSH and use the command line.
The command line can also be referred to as the “unix shell”. With this you use text commands (just like the RStudio console) to copy files, edit text files, interact with the scheduler, view job status, etc. See the Hipergator Connection guide.

Some unix tutorials: - Scinet Geospatial Unix Intro - Data Carpentry tutorial on the unix shell

If you work on a Mac computer you have a full unix shell available already called the Terminal. For Windows users there are several options available. See the bottom of the Setup page for the Data Carpentry unix shell tutorial.
You must optimize your code to run on the HPC.
This is potentially the trickiest part. At a minimum your code must be able to run independently without any interaction from you. Do you have one large (or even several) analysis script where you highlight different parts to run in the correct order? Or to check output before moving on? That will not work on an HPC. A single R or python file (aka a script) must run from start to finish and write some results to a file to be able to be useful on the HPC.

For R, a good test for this is the Jobs tab in RStudio (next to Console and Terminal tabs. Not available on older versions of RStudio). This is very analogous to running a script in an HPC environment. If your script can run as an RStudio Job without copying the local environment or copying the job results anywhere (your script should write results to some file) then it should be able to run on the HPC.

For python a good test is being able to run the script via the command line (ie. python my_analysis.py). If you are using an IDE (like spyder or pycharm) then running a full script from start to finish using the “Run” option should also be sufficient.

Having a script run without any interaction does not necessarily mean it needs to have parallel processing. See Should I make my code run parallel? below.
You have to get your code and data onto the HPC.
You’ll need to use special programs to transfer files (both data and scripts) from your local computer to the HPC. For windows this will be the WinScp program, which will have the same username and password as logging into the command line. For mac or linux users you can use the Terminal to transfer files via the command line using the scp command. Read more about the scp command here. More on data transfer for HiperGator can be found here. Something you’ll see mentioned a lot is Globus, which is a useful (but not strictly required) tool when you need to transfer 100GB+ of data.
You need to ensure you have the correct packages.
Most HPC systems will have common packages installed and ready to use. If not you’ll have to install them yourself. If you do this then the latest versions will be installed on the HPC, so it’s good practice to make sure all packages on your personal computer are up to date to so they match (in RStudio use Tools-> Check for package updates, in python use conda or pip to update all package to the latest version). For python packages for your projects you’ll want to use environments with either conda or python virtual environments.

If you run into errors installing R or python packages you’ll likely need to contact HPC support for help, especially if the errors involve missing system libraries. If you successfully install your own packages they will only be available to you, and not to anyone else using the HPC.

Also take note of the module command on the HiperGator. This is used to load preloaded software, including R and python themselves. This is covered is most tutorials about batch scripts (see next section).
You can now submit your scripts to the scheduler.
Once your data and scripts are on the HPC system you can submit them to the scheduler to run. This involves defining “jobs” where you tell the scheduler what you need. Specifically: the location of your script, the resources needed (cpus and memory), the time needed to run your script, the location for putting script output and logs, etc.

Jobs are defined via batch scripts which have a line for each piece of information.

Some examples:
You might need to debug your script if it doesn’t run correctly.
It’s very common for scripts to not run the first time because they were written on a personal computer and things like directory paths and packages may be different. In this case it’s useful to debug your script on the HPC directly. A good place to do this is an “interactive node” or “interactive session”. For these instead of submitting a job to the queue, you request a new unix shell where a small amount of resources are available. Here you can run your scripts via the Rscript or python command and see the output directly, and make adjustments as needed until it runs successfully.
- HiperGator guide on interactive sessions
- HiperGator guide on interactive sessions when using GPUs
You can now get your results back.
This is the same process as putting scripts and data onto the HPC but in reverse.

Should I make my code run on parallel processes?

Before you dive into making your script parallel, do a quick cost/benefit analysis. It may indeed take a full day or more to redo your code to take advantage of parallel processing, but the benefits could be extremely large. If your code already runs in a relatively short time, like a few hours on your laptop and less than an hour on the HPC without any modification, and you’re happy with that then making it parallel might not be worth it.

If you do not use parallel processing then your jobs will always request just a single processor. This is perfectly fine as there is no minimum requirement for using an HPC.

Make your scripts use parallel processing

By default R and python run on a single processor. Most computers today have 4-8 processors. If you spread the work out to multiple processors you can significantly decrease the amount of time it takes to run. For example: a script that takes 1 hour to run can potentially take 30 minutes with 2 processors, or 15 minutes with 4 processors. To make your scripts run across multiple processors, you’ll have to make some adjustments to your code.

For R users, if your code uses lapply to run your main function to many items (e.g. fitting the same model to many species), you can swap it for mclapply from the parallel package without making any substantial changes. For more details and advanced uses, here are some short tutorials that go over this:

Some notes:

If your code already uses functions and for loops, it should be straightforward to make it parallel, unless each pass through the loop depends on the outcome from previous passes.
On your own computer, never set the amount of processors used to the max available. This will take away all the processing power needed to run the operating system, browser, and other programs, and could potentially crash your computer. To test out parallel code on my computer I set the number of processors to use at 2 (out of 8 available). Then when the scripts are moved to the HPC I set the amount to something higher.

What about distributed computing?

The links and examples for parallel computing above show you how to utilize the multiple processors in a single system. In the case of the HPC this means up to (usually) 64-128 processors on a single server. But what if you still need more processing power? In this case it’s possible to write parallel code which takes advantages of the processors on numerous individual servers. This is how one utilizes 100’s or even 1000’s of processors.

Going from single system parallel processing to distributed computing with your script is possible but will likely take even more work on your part. For this you might come across tutorials using MPI technology. MPI (Message-Passing Interface) is the protocol for analysis scripts to communicate between servers in an HPC environment, thus enabling distributed computing. Packages to use MPI are available in all common languages such as R, python, julia, and matlab.

Newer packages are available which either deal with MPI in the background for you, or implement newer protocols. The R package batchtools has many high end functions for distributed computing. The python package dask is a state of the art package for distributed computing, and the accompanying jobqueue package integrates it with SLURM and other HPC schedulers.

Other considerations and important points

The Research Computing group has a wiki on HiperGator usage.

Some common tasks and scripts are outlind in the HiPerGator Reference Guide

Login Node: When you sign into the HPC there is a single landing server which you’ll start on. It’s important to never run actual scripts on this initial server. It should be used to submit jobs, request development nodes or interactive sessions, or transfer data in or out.

Partitions: HiperGator resources are divided into partitions, where each partition has a specific set of hardware and resource and tim**e limits. Whenever you request resources you’ll specify which partition you want to use. See the partitions wiki page.

Account limits: The resources you request (eg. number of processors and amount of memory) is limited by how many credits your group has purchased. The number of jobs which can be running concurrently is also determined by this. See more on the account limits wiki page. This is also referred to as QOS (quality of service), a term coined in the early internet days.

Processors/CPU/Cores/Sockets/Threads: Each of these things is technically different and has a distinct definition. For most users of an HPC system they can be thought of interchangeably though. When you request resources via a batch script, or other method, you’ll usually ask for multiple CPUs to implement parallel processing and leave it at that. Advanced users can read about different terminology here or here.

HiPerGator Reference

Mon, 01 Jan 0001 00:00:00 +0000

What is HiperGator?

A University of Florida super-computing cluster.

Why should I use it?

HiperGator gives the user access to very large processing/memory/storage. This is useful for projects which can’t be run on your local laptop.

How do I access it?

Request an account
Connect with ssh <YOUR_USERNAME>@hpg2.rc.ufl.edu from the Unix terminal or a Windows SSH client (more info here). Enter your password when prompted.

Need help with command line? A good tutorial is available at Software Carpentry.

How do I run a job?

For large analysis, you should submit a batch script that tells Hipergator how to run your code. Let’s look at an example and walk through it.

#!/bin/bash

# Job name and who to send updates to
#SBATCH --job-name=<JOBNAME>
#SBATCH --mail-user=<EMAIL>
#SBATCH --mail-type=FAIL,END
#SBATCH --account=ewhite
#SBATCH --partition=hpg2-compute
#SBATCH --qos=ewhite-b # Remove the `-b` if the script will take more than 4 days; see "bursting" below

# Where to put the outputs: %j expands into the job number (a unique identifier for this job)
#SBATCH --output my_job%j.out
#SBATCH --error my_job%j.err

# Number of nodes to use
#SBATCH --nodes=1

# Number of tasks (usually translate to processor cores) to use: important! this means the number of mpi ranks used, useless if you are not using Rmpi)
#SBATCH --ntasks=1 

#number of cores to parallelize with:
#SBATCH --cpus-per-task=15
#SBATCH --mem=16000
# Memory per cpu core. Default is megabytes, but units can be specified with M 
# or G for megabytes or Gigabytes.
#SBATCH --mem-per-cpu=2G

# Job run time in [DAYS]
# HOURS:MINUTES:SECONDS
# [DAYS] are optional, use when it is convenient
#SBATCH --time=72:00:00

# Save some useful information to the "output" file
date;hostname;pwd

# Load R and run a script named my_R_script.R
Rscript my_R_script.R

If you are successful, you’ll get a small message stating your job ID. Once your batch job is running, you can freely log out (or even turn off your local machine) and wait for an email telling you that it finished. You can log back in to see the results later.

Interactive work

CPU

If you are running into errors, need to install a package in your local directory, or want to download some files, you should use a development server. This is good practice and nice to other people who are logged into the main head node.

#load module made by hipergator admin
ml ufrc
#request a server for 3 hours with 2GB of memory
srundev --time 3:00:00 --mem 2GB

GPU

To test out work involving a GPU you need to explicitly request a development node associated with a GPU. For many GPU tasks you may want a meaningful amount of memory.

In most cases you’ll use the default GPUs:

srun --nodes=1 --gpus=1 --mem 20GB --cpus-per-task=1 --pty -u bash -i

But if you need a lot of VRAM (>24 GB/GPU) you can use the B200 nodes:

srun -p hpg-b200 --nodes=1 --gpus=1 --mem 50GB --cpus-per-task=1 --pty -u bash -i

To increase the number of GPUs increase the value of --gpu, but you typically shouldn’t need more than 2 for interactive work and then only if you’re setting up multi-GPU testing.

To increase the number of CPUs increase the value for --cpus-per-task

How do I know if its running?

Use squeue -u

[b.weinstein@login3 ~]$ squeue -u b.weinstein
 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
 25666905 gpu DeepFore b.weinst R 22:29:49 1 c36a-s7
 25672257 gpu DeepFore b.weinst R 21:07:19 1 c37a-s36

The column labeled “S” is your job status. You want this to be R for “running”, but it can spend a while as Q (in the queue) before starting, especially if you request many cores. Sometimes (but not always) the output will explain why you’re still in the queue (e.g. QOSMEMLIMIT if you’re using too much memory).

How do I get my data on to HiperGator?

Often, the easiest way to transfer files to the server is using git clone.
If your files aren’t in a git repository, you can use FTP or scp. FTP has graphical user interfaces that allow you to drag and drop files to the server.
If you use scp, the syntax for copying one file from your user folder on the server to your local folder is scp MY_USER_NAME@gator.hpc.ufl.edu:/home/MY_USER_NAME/PATH_TO_MY_FILE MY_LOCAL_FILENAME. Note the space between the remote path and your local filename. If you want to send a file in the other direction, switch the order of the local file and the remote location. You can copy whole folders with the -r flag.
If your files are large you should use Globus. See the wiki page on Globus
More information about storage

Storage

There are a few locations to store files on the hipergator

/blue/ewhite/
This is the primary space for storing large files, any large amounts of data generated by your programs should be pointed to this location.

/orange/ewhite/
This is another space to store large files. The total allocation here is much bigger than /blue, but this storage space is slower than /blue. If you have 100GB’s of data that you want to save, but are not currently using, /orange is where they should go.

/home/your_username/
Your home directory has 20GB storage space for your scripts, logs, etc. You should not be storing large amounts of data here.

Local scratch storage

$TMPDIR
/blue may be a bad place to store temporary cache files, especially if your program is generating 100’s of small (<1mb) files. An alternative is to use a temporary directory setup by SLURM every time you run a job. This can be referenced with the environment variable $TMPDIR or SLURM_TMPDIR. Read more about this here: Temporary Directories .

This storage is available on each worker (but not login nodes) but does not persist. For example, it can be referenced from python

import os
os.env["TMPDIR"]

To access this storage during a run, you can interactively ssh into the node and check out what’s there. The folder is named /scratch/local/{job_pid}.

Note that for /blue and /orange if you are working on individual projects that are part of a larger effort you should work in a subdirectory /blue/ewhite/<your_username>/ and /orange/ewhite/<your_username>/.

Our current allocations as of July 2019

Storage Type	Location	Quota
orange	`/orange`	48 TB
blue	`/blue`	25 TB
home	`/home/<your_username>`	20 GB

Status

You can check the status of HiPerGator using the Metrics Dashboard

Best Practices

Below are a collection of best practices by past Weecology Users. These are not the only way to do things, just some useful tools that worked for us.

R

Installing packages

HiPerGator has a lot of packages installed already, but you might need to install your own, or you might want an updated version of an existing package.

You can tell R to prefer your personal library of R packages over the ones maintained for Hipergator by adding .libPaths(c("/home/YOUR_USER_NAME/R_libs", .libPaths())) to your .Rprofile. If you don’t have one yet, you can create a new file with that name and put it in your home directory (e.g. in /home/harris.d/.Rprofile).

The end result will look like this.

[b.weinstein@dev1 ~]$ cat ~/.Rprofile
.libPaths(c("/home/b.weinstein/R_libs", .libPaths()))

print(".Rprofile loaded")

You will need to create the R_libs directory using mkdir R_libs.

When you load R, you should see

[b.weinstein@dev1 ~]$ ml R
[b.weinstein@dev1 ~]$ R

R version 3.5.1 (2018-07-02) -- "Feather Spray"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

 Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

[1] ".Rprofile loaded"
>

Once this is set up, you can install or update packages the usual way (e.g. with install.packages or devtools::install_github).

Re-writing your code to take advantage of multiple cores.

By default R runs on a single processor. Most computers today have 4-8 processors. If you spread the work out to multiple processors you can decrease the amount of time it takes to run by significantly. For example: a script that takes 1 hour to run can potentially take 0.5 hours with 2 processors, or 15 minutes with 4 processors. To make your scripts run across multiple processors, you’ll have to make some slight adjustments to your code.

If your code uses lapply to run your main function to many items (e.g. fitting a model to each species), you can swap it for mclapply from the parallel package without making any substantial changes. For more details and advanced uses, here are 2 short tutorials that go over this:

Some quick notes:

If your code already uses functions and for loops, it should be very easy to make it parallel, unless each pass through the loop depends on the outcome from previous passes.
On your own computer, never set the amount of processors used to the max available. This will take away all the processing power needed to run the operating system, browser, and other programs, and could potentially crash your computer. To test out parallel code on my computer I set the number of processors to use at 2 (out of 8 available).

Batchtools

Recently, the R package batchtools has made simple parallel job submissions in R much easier. No more bash scripting, just submit a set of jobs by mapping a function to a list of inputs. Here is an example.

library(batchtools)

#Batchtools tmp registry
reg = makeRegistry(file.dir =".")
print(reg)
print("registry created")

print(reg)
# Toy function that just sleeps
fun = function(n) {
 Sys.sleep(10)
 t<-Sys.time()
 print(paste("worker", n, "time is",t))
 return(t)
}

#batchtools submission
reg$cluster.functions=makeClusterFunctionsSlurm(template = "detection_template.tmpl", array.jobs = TRUE,nodename = "localhost", scheduler.latency = 1, fs.latency = 65)
ids = batchMap(fun,args=list(n=seq(1:10)), reg = reg)

testJob(id=ids[1,],reg=reg)

# Set resources: enable memory measurement
res = list(walltime = "2:00:00", memory = "4GB")

# Submit jobs using the currently configured cluster functions
submitJobs(ids, resources = res, reg = reg)
waitForJobs(ids, reg = reg)
getStatus(reg = reg)
print(getJobTable())

with a SLURM template in the same directory.

detection_template.tmpl

#!/bin/bash

## Modified from https://github.com/mllg/batchtools/blob/master/inst/templates/

## Job Resource Interface Definition
##
## ntasks [integer(1)]: Number of required tasks,
## Set larger than 1 if you want to further parallelize
## with MPI within your job.
## ncpus [integer(1)]: Number of required cpus per task,
## Set larger than 1 if you want to further parallelize
## with multicore/parallel within each task.
## walltime [integer(1)]: Walltime for this job, in seconds.
## Must be at least 60 seconds for Slurm to work properly.
## memory [integer(1)]: Memory in megabytes for each cpu.
## Must be at least 100 (when I tried lower values my
## jobs did not start at all).
##
## Default resources can be set in your .batchtools.conf.R by defining the variable
## 'default.resources' as a named list.

<%
# relative paths are not handled well by Slurm
log.file = fs::path_expand(log.file)

if (!"ncpus" %in% names(resources)) {
 resources$ncpus = 1
}
if (!"walltime" %in% names(resources)) {
 resources$walltime<-"1:00:00"
}
if (!"memory" %in% names(resources)) {
 resources$memory <- "5GB"
}

-%>

 # Job name and who to send updates to
 #SBATCH --mail-user=benweinstein2010@gmail.com
 #SBATCH --mail-type=FAIL,END
 #SBATCH --account=ewhite
 #SBATCH --partition=hpg2-compute
 #SBATCH --qos=ewhite-b # Remove the `-b` if the script will take more than 4 days; see "bursting" below

 #SBATCH --job-name=<%= job.name %>
 #SBATCH --output=<%= log.file %>
 #SBATCH --error=<%= log.file %>
 #SBATCH --time=<%= resources$walltime %>
 #SBATCH --ntasks=1
 #SBATCH --cpus-per-task=<%= resources$ncpus %>
 #SBATCH --mem-per-cpu=<%= resources$memory %>
 <%= if (!is.null(resources$partition)) sprintf(paste0("#SBATCH --partition='", resources$partition, "'")) %>
 <%= if (array.jobs) sprintf("#SBATCH --array=1-%i", nrow(jobs)) else "" %>

## Initialize work environment like
## source /etc/profile
## module add ...

source /etc/profile

## Export value of DEBUGME environemnt var to slave
export DEBUGME=<%= Sys.getenv("DEBUGME") %>

<%= sprintf("export OMP_NUM_THREADS=%i", resources$omp.threads) -%>
<%= sprintf("export OPENBLAS_NUM_THREADS=%i", resources$blas.threads) -%>
<%= sprintf("export MKL_NUM_THREADS=%i", resources$blas.threads) -%>

## Run R:
## we merge R output with stdout from SLURM, which gets then logged via --output option
echo "submitting job"
module load gcc/6.3.0 R gdal/2.2.1

#add to path
Rscript -e 'batchtools::doJobCollection("<%= uri %>")'

yields

Submitting 10 jobs in 10 chunks using cluster functions 'Slurm' ...
[1] TRUE
Status for 10 jobs at 2019-05-21 14:43:03:
 Submitted : 10 (100.0%)
 -- Queued : 0 ( 0.0%)
 -- Started : 10 (100.0%)
 ---- Running : 0 ( 0.0%)
 ---- Done : 10 (100.0%)
 ---- Error : 0 ( 0.0%)
 ---- Expired : 0 ( 0.0%)

Python

Installing Python Packages

ssh onto HiperGator
Download the conda installer: wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Run the installer: bash Miniconda3-latest-Linux-x86_64.sh
Answer ‘Yes’ at the end of the install to have conda added to your .bashrc
Install packages using conda install package_name
Run conda activate as the first step in your slurm script

Dask Parallelization

Dask can be submitted through dask-jobqueue.

#################
 # Setup dask cluster
 #################

 from dask_jobqueue import SLURMCluster
 from dask.distributed import Client, wait

 num_workers = 10

 #job args
 extra_args=[
 "--error=/home/b.weinstein/logs/dask-worker-%j.err",
 "--account=ewhite",
 "--output=/home/b.weinstein/logs/dask-worker-%j.out"
 ]

 cluster = SLURMCluster(
 processes=1,
 queue='hpg2-compute',
 cores=1,
 memory='13GB',
 walltime='24:00:00',
 job_extra=extra_args,
 local_directory="/home/b.weinstein/logs/", death_timeout=300)

 print(cluster.job_script())
 cluster.adapt(minimum=num_workers, maximum=num_workers)

 dask_client = Client(cluster)

 #Start dask
 dask_client.run_on_scheduler(start_tunnel)

 futures = dask_client.map(<function you want to parallelize>, <list of objects to run>, <additional args here>)
 wait(futures)

Connecting through jupyter notebooks.

Its useful to be able to interact with hipergator, without having to rely solely on the terminal. Especially when dealing with large datasets, instead of prototyping locally, then pushing to the cloud, we can connect directly using a jupyter notebook.

Log on to hipergator and request an interactive session.

srun --ntasks=1 --cpus-per-task=2 --mem=2gb -t 90 --pty bash -i

Now we have 90 minutes work directly on this development node.

Create a juypter notebook

Load the python module

module load python

Start the notebook and get your ssh tunnel

import socket
import subprocess
host = socket.gethostname()
proc = subprocess.Popen(['jupyter', 'lab', '--ip', host, '--no-browser'])

print("ssh -N -L 8888:%s:8888 -l b.weinstein hpg2.rc.ufl.edu" % (host))

If all went well it should look something like:

[b.weinstein@c27b-s2 dask-jobqueue]$ python
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> import subprocess
>>> host = socket.gethostname()
>>> proc = subprocess.Popen(['jupyter', 'lab', '--ip', host, '--no-browser'])
>>>
>>> print("ssh -N -L 8888:%s:8888 -l b.weinstein hpg2.rc.ufl.edu" % (host))
ssh -N -L 8888:c27b-s2.ufhpc:8888 -l b.weinstein hpg2.rc.ufl.edu
>>> [I 17:11:29.776 LabApp] The port 8888 is already in use, trying another port.
[I 17:11:29.799 LabApp] JupyterLab beta preview extension loaded from /home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/jupyterlab
[I 17:11:29.799 LabApp] JupyterLab application directory is /home/b.weinstein/miniconda3/envs/pangeo/share/jupyter/lab
[I 17:11:29.809 LabApp] Serving notebooks from local directory: /home/b.weinstein/dask-jobqueue
[I 17:11:29.809 LabApp] 0 active kernels
[I 17:11:29.809 LabApp] The Jupyter Notebook is running at:
[I 17:11:29.809 LabApp] http://c27b-s2.ufhpc:8889/?token=0c9c992a219e1e35ddd4cbe782d7f1f56c6680118b13053c
[I 17:11:29.809 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 17:11:29.811 LabApp]

 Copy/paste this URL into your browser when you connect for the first time,
 to login with a token:
 http://c27b-s2.ufhpc:8889/?token=0c9c992a219e1e35ddd4cbe782d7f1f56c6680118b13053c

see that line ssh…, that is what we need to enter in our local laptop. It will ask for your login password

MacBook-Pro:~ ben$ ssh -N -L 8888:c27b-s2.ufhpc:8888 -l b.weinstein hpg2.rc.ufl.edu
b.weinstein@hpg2.rc.ufl.edu's password:

Don’t worry if it looks like it hangs, the tunnel is open! Go check it out.

Opening your browner, go to localhost:8888

and viola, we are navigating hipergator from the confines of our own laptop.

Support

Request Support.

Hipergator staff are here to support you. Our grant money pays their salary. They are friendly and eager to help. When in doubt, just ask.

For more information on (job submission scripts)[https://wiki.rc.ufl.edu/doc/Annotated_SLURM_Script]

Priority

The supercomputer is a shared resource, and the SLURM scheduler has to decide how to divvy it up. The method they use for deciding when it’s your turn to use a machine based on a metric called “FairShare.” You can see your FairShare number by typing sshare -U in your hipergator terminal. A FairShare of 0.5 means you’ve been using exactly your share. Larger numbers mean you can use more, while smaller numbers mean you’re using more than your share and will be given lower priority.

Your “usage” number is an exponentially-weighted moving average of the resources you’ve consumed, with a half-life of two weeks. So if you’ve “bursted” at 10x for a while, it might take a few weeks before you’re given decent priority again.

A more comprehensive description of FairShare is available here.

Bursting

If your jobs will take less than 4 days, you can use “burst” mode, which provides ten times as many cores and ten times as much memory as the default mode. If you cannot burst, just remove the -b from the line above about qos. Note than if you are using burst your jobs will automatically be killed after 96 hours if they haven’t already finished.

Current usage

To see the current usage by our group, as well as overall hipergator usage, use the command

slurmInfo -pu

To see the total available resources use:

sacctmgr show qos ewhite format="Name%-16,GrpSubmit,MaxWall,GrpTres%-45"
for the normal queue, and
sacctmgr show qos ewhite-b format="Name%-16,GrpSubmit,MaxWall,GrpTres%-45"
for the “burst” queue.

If you want to look at the active resource use by a current job (i.e., how much of the requested resources are actually being used by the code):

ml ufrc

jobhtop JOBID # Displays CPU and RAM resource usage for running jobs
jobnvtop JOBID #Displays GPU resource usage for running GPU jobs

Both of these take a while to start (up to ~2 min).

Partitions

The HiperGator consists of hundreds of servers. These a split up into several “partitions” for various reasons.

Most of the time you can just use the defaults, but there are a few species partitions available that you might been to request:

hpg-b200 - This is the partition to use for GPU jobs with large VRAM needs (>24 GB/GPU). You need to have paid for GPU access, which our lab has.
bigmem - This partitions consists of several servers with up to 1TB of memory. This is useful if you need a lot of memory but still want to keep a script on a single server.
hpg2-dev - These are several servers for development purposes. When you use srundev the jobs get sent here.

Selecting a partition

By default you’ll run jobs on the hpg2-compute partitions. If you want to change it, edit the --partition line in your job script, or use the -p command in srun.

Cron jobs - how to run regularly scheduled jobs

If you’re unfamiliar with cron jobs read A Beginners Guide To Cron Jobs.

SSH to daemon

Cron jobs on the HPC need to be setup on a special machine called daemon. You can ssh there from the HPC using ssh daemon. After that you can use the usual crontab -e to setup your cron job.

Setting PATH for cron jobs

For some reason the PATH isn’t properly set when running cron jobs, so you need to set it at the top of the crontab. Add a line like this adding any additional paths you need (e.g., the location of your conda environments).

PATH=/opt/slurm/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/bin:/home/USERNAME/bin:/blue/ewhite/USERNAME/miniconda3/bin/

Check if running on HiPerGator

Sometimes it’s useful to have code execute one way on your local computer and another way on the HPC. For HiPerGator you can do this by checking the environmental variable HOSTNAME and looking to see if it contains ufhpc. For example, in R

if (grepl("ufhpc", Sys.getenv("HOSTNAME"))){
 hipergator_run()
} else {
 local_run()
}

If you are submitting via SLURM, the hostname will not contain “ufhpc” but the nodename will. So use this logic:

nodename <- Sys.info()["nodename"]
if(grepl("ufhpc", nodename)) {
 print("I know I am on SLURM!")
}

Using RStudio on the hipergator

See the main Wiki page here on running gui programs. https://help.rc.ufl.edu/doc/GUI_Programs#Start_a_GUI_Session_on_HiPerGator

Using VSCODE on hipergator

Vscode is a great development environment for many languages (python, java, bash), and allows powerful integration with github copilot and other debugging tools. The docs on hipergator hint at how to do this, but don’t make it clear how to check out a node and develop with those resources. We can use vscode tunnels to do this easily.

Start by creating a SLURM script to get a development node. In this case, I want a GPU node.

(base) [b.weinstein@login11 ~]$ cat tunnel.sh
#!/bin/bash
#SBATCH --job-name=tunnel # Job name
#SBATCH --mail-type=END # Mail events
#SBATCH --mail-user=benweinstein2010@gmail.com # Where to send mail
#SBATCH --account=ewhite
#SBATCH --nodes=1 # Number of MPI ran
#SBATCH --cpus-per-task=10
#SBATCH --mem=70GB
#SBATCH --time=12:00:00 #Time limit hrs:min:sec
#SBATCH --output=/home/b.weinstein/logs/tunnel.out # Standard output and error log
#SBATCH --error=/home/b.weinstein/logs/tunnel.err
#SBATCH --gpus=1

module load vscode
export XDG_RUNTIME_DIR=${SLURM_TMPDIR}; code tunnel

Submit the job and view the logs

(base) [b.weinstein@login11 ~]$ sbatch tunnel.sh
(base) [b.weinstein@login11 ~]$ cat /home/b.weinstein/logs/tunnel.out
*
* Visual Studio Code Server
*
* By using the software, you agree to
* the Visual Studio Code Server License Terms (https://aka.ms/vscode-server-license) and
* the Microsoft Privacy Statement (https://privacy.microsoft.com/en-US/privacystatement).
*
[2024-05-23 11:56:57] info Using Github for authentication, run `code tunnel user login --provider <provider>` option to change this.
To grant access to the server, please log into https://github.com/login/device and use code 3390-CCD9

Go to https://github.com/login/device and authenticate with the code. You will see the hipergator logs change and successfully connect.
Go to your local vscode instance, active the ‘remote explorer’ extension and click on ’tunnels’. You will see the hipergator tunnel listed.

Success! Now you are the GPU node and can debug and run with those resources!

Lab Coding Guidelines

Mon, 01 Jan 0001 00:00:00 +0000

This is a summary of the 2018-10-10 lab meeting where we discussed coding practices. The full notes are available online.

Desired Qualities:

completeness
- code for all the figures and analyses
- external dependencies documented
readable
- uses good standards for code style
- comments help guide navigation to different parts
(re)usable
- description of how to run everything, few changes needed to run it all
- examples for functions
- functions written to be flexible (e.g. less dependence on “magic numbers” and hard-coded parameter values)

Practices:

linter (for style)
tests (to check functions, could also provide simple examples)
pair programming checks for readability
documentation
refactoring core code into reusable packages
containers

Action Items:

training (and organizing it)
toolchain (linter, tests, development tools)
workflow and organization
regular practices

Lab Server - Serenity

Mon, 01 Jan 0001 00:00:00 +0000

Overview

Some useful commands to help navigate and use Serenity

To log into Serenity, you will need to be under the university network or use a VPN provided by the university.

ssh username@serenity.ifas.ufl.edu

In case of the warnings below

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!

edit ~/.ssh/known_hosts and remove the line with serenity

To change your password, use

passwd username

sudo passwd username

To set up ssh keys

RStudio on serenity

Editing the Lab Website

Mon, 01 Jan 0001 00:00:00 +0000

Cloning the Website Repository

First, fork the website repository to your own GitHub account. Then, clone your fork to your local machine and create a new branch. It’s recommended to work on a branch rather than the default main branch to keep your changes organized.

One-Time Setup

To get started, follow these steps for the initial setup:

Install the blogdown R package:
```
install.packages("blogdown")
```
Install Hugo:
```
blogdown::install_hugo()
```
Create a New R Project: Open RStudio, navigate to your local copy of the website repository, and create a new R project in that directory.

Adding New Members

To add a new member to the website, follow these steps:

Create a Directory: Create a directory named content/authors/first-lastname/.
Add an Avatar Image: Place an avatar image file named avatar.jpg in the directory: content/authors/first-lastname/avatar.jpg.
Create and Customize an _index.md File: Copy an existing _index.md file from content/authors/ into the new directory and customize it with the new member’s details:
```
cp content/authors/existing-author/_index.md content/authors/first-lastname/_index.md
```

Viewing Your Changes Locally

To see your changes in real-time on your local computer, follow these steps:

Open the R Project: Open your R project in RStudio.
Serve the Site: Run the following command in the R console:
```
blogdown::serve_site()
```
View in Browser: The site will appear in a pane within RStudio. Click the Show in new window button to open it in your web browser. The site should update automatically as you save files, though it may take a few seconds to reflect changes.

Submitting Your Changes

When you’re ready to submit your changes, follow these steps:

Create a Pull Request (PR): Push your branch to your GitHub fork and create a pull request to the original repository. This will allow others to review your changes.
Netlify Deployment Preview: After creating the PR, Netlify will automatically provide a deploy preview for the website. Check this preview to ensure everything looks correct.
Merge the PR: If everything works fine, the PR will be merged.

After Merging

Once the PR is merged, the website will rebuild automatically. This process may take a few minutes, so please be patient. any issues, don’t hesitate to ask for help.

Parallelizing Code in R

Mon, 01 Jan 0001 00:00:00 +0000

Overview

The basic idea of parallelization is the running of computational tasks simultaneously, as opposed to sequentially (or “in parallel” as opposed to “in sequence”). To do this, the computer needs to be able to split code into pieces that can run independently and then be joined back together as if they had been run sequentially. The parts of the computer that run the pieces of code are processing units and are typically called “cores”.

The doParallel package is (as far as I’m currently aware) the most platform-general and robust parallel package available in R. There is more functionality for Unix-alikes in the parallel package (see link below), but that doesn’t transfer to Windows machines.

library(doParallel)

Cores and Clusters

At the outset, it’s important to know how many cores you have available, which the detectCores() function returns. However, the value returned includes “hyperthreaded cores” (aka “logical cores”), which include further computational divvyings of the physical cores.

detectCores()

detectCores(logical = FALSE)

Although hyperthreaded cores are available, they often do not show speed gains in R. However, as with everything parallel in R, it’s best to actually test that on the problem you’re working on, as there will be gains in certain situations.

At the user-interface level, we only interact at a single point, yet the code needs to access multiple cores at once. To achieve this, we have the concept of a “cluster”, which represents a parallel set of copies of R running across multiple cores (including across physical sockets). We create a cluster using the makeCluster function and register the backend using the registerDoParallel function:

ncores <- detectCores(logical = FALSE)

cl <- makeCluster(floor(0.75 * ncores))
registerDoParallel(cl)

stopCluster(cl)

Here, I’ve stopped the cluster explicitly using stopCluster, which frees up the computational resources. The cluster will be automatically stopped when you end your R instance, but it’s a good habitat to be in to stop any clusters you make (even if you don’t register the backend).

foreach and %dopar%

Parallelization in doParallel happens via the combination of the foreach and %dopar% operators in a fashion similar to for loops. Rather than for(variable in values) {expression}, we have foreach(variable = values, options) %dopar% {expression}. Thus, the basic code block of a foreach parallel loop is

cl <- makeCluster(floor(0.75 * ncores))
registerDoParallel(cl)
foreach(variable = values, options) %dopar% {
 expression
}
stopCluster(cl)

Note that there can be multiple variable arguments (e.g., foreach(i = 1:10, j = 11:20) %dopar% {}), although no variables are recycled, so the smallest number of values across variables is used.

Each of the instances of a foreach is referred to as a “task” (a key word, especially in terms of error checking, see link below).

An important distinction between foreach and for is that foreach returns a value (by default, a list), whereas for causes side effects. Compare the following:

cl <- makeCluster(floor(0.75 * ncores))
registerDoParallel(cl)
out <- foreach(i = 1:10) %dopar% {
 i + rnorm(1, i, i)
}
stopCluster(cl)

vs.

out <- vector("list", 10)
for(i in 1:10){
 out[[i]] <- i + rnorm(1, i, i)

foreach creates out; for modifies it.

There are a handful of option arguments in foreach that are really critical for anything beyond trivial computation:

.packages passes the libraries down to the cluster of Rs. If you don’t include package names and you use a package-derived function, it won’t work
.combine dictates how the output is combined, defaulting to a list, but allowing lots of flexibility including mathematical operations. If needed, the .init option allows you to initialize the value (for example with 0 or 1).
.inorder sets whether the combination happens based on the order of inputs. If the order isn’t important, setting this to FALSE can give performance gains.
.errorhandling determines what happens when a task fails (see link below for using tryCatch within tasks).

In addition to %dopar%, there is a sequential operator for use with foreach, and it is simply %do%. Replacing %dopar% with %do% will cause the code to run in-order, as if you did not initiate a cluster. Similarly, you can run foreach and %dopar% without having made a cluster, but the code will be run sequentially.

Nesting foreach loops

Nested loops can often be really powerful for computation. doParallel has a special operator that combines two foreach objects in a nested manner: %:%. This operator causes the outer most foreach to be evaluated over its variables’ values, which are then passed down into the next innermost foreach. That foreach iterates over its variables’ values for each value of the outer foreach variables.

cl <- makeCluster(floor(0.75 * ncores))
registerDoParallel(cl)
out <- foreach(i = 1:10) %:%
 foreach(j = 1:100) %dopar% {
 i * j^3
 }
out
stopCluster(cl)

The .combine option is really important to pay attention to with nested foreach loops, as it will allow you to flexibly structure the data. The above version produces a list (length 10) of lists (length 100).

Seeds and RNGs

One of the downfalls of foreach and %dopar% is that the parallel runs aren’t reproducible in a simple way. There are ways to code seed setting up, but it’s a little obtuse. Thankfully, the doRNG package has you taken care of. There are a few ways to code up a reproducible parallel loop.

library(doRNG)
cl <- makeCluster(floor(0.75 * ncores))
registerDoParallel(cl)

# 1. .options.RNG

out1 <- foreach(i = 1:10, .options.RNG = 1234) %dorng% {
 rnorm(1, i, i^2)
}

# 2. set.seed

set.seed(1234)
out2 <- foreach(i = 1:10) %dorng% {
 rnorm(1, i, i^2)
}

# 3. registerDoRNG (note that this doesn't replace the registerDoParallel!)

registerDoRNG(1234)
out3 <- foreach(i = 1:10) %dorng% {
 rnorm(1, i, i^2)
}

stopCluster(cl)

identical(out1, out2)
identical(out1, out3)

Speed Gains

The degree to which your code speeds up when you parallelize it depends on a few factors: how many cores you have, whether the code can take advantage of hyperthreading, the run time of the code itself, and your ease of coding in parallel. There is definitely some computational and time overhead involved in setting up a cluster, distributing the work, and combining the results. Short tasks (seconds to a half-minute), therefore, are usually faster to run in sequence. Once tasks start creeping up to the half-minute mark on a dual-core machine (or the less time on a more-core machine), the parallel runs will start to be faster. As the runtime of the computation increases, the relative gain by parallelizing increases, although it never gets to a fully fractional decrease of time (i.e., 6 cores won’t ever get you to 1/6 the runtime…probably more like 1/4 - 1/5) due to overhead. The other thing to keep in mind with parallelization is that there is often some additional coding time involved, and there can be issues that require an additional level of troubleshooting that quickly eliminates the time gains.

See references below for speed tests.

References

Python - Package Documentation

Mon, 01 Jan 0001 00:00:00 +0000

Add docs

Install sphinx and the markdown parser

pip install sphinx myst-parser

Make a docs directory and run the quick-start

mkdir docs
cd docs
sphinx-quickstart

Add markdown and autodoc extensions to `conf.py`

extensions = ['myst_parser',
 'sphinx.ext.autodoc']

Make docs files and edit index.rst

Edit index.rst to include the information you
Add markdown files for each separate page of docs
In index.rst add the names of the markdown files (without extensions) to the toctree block. E.g., if we want to include an the docs in installation.md and getting-started.md:

.. toctree::
 :maxdepth: 2
 :caption: Contents:

 installation
 getting-started

Setup automatic function documentation

sphinx-apidoc -f -o source ../<package-name>

Change the theme

Pick a theme (we currently use either spinx-rtd-theme or furo)
Install it

pip install sphinx_rtd_theme

Change the theme value in in conf.py

html_theme = "sphinx_rtd_theme"

If using the sphinx_rtd_theme also add it to extensions

extensions = ['myst_parser',
 'sphinx.ext.autodoc',
 'sphinx_rtd_theme']

Build the docs

make html

Build docs automatically on readthedocs

Your project should be in an online repository

Add a docs requirements.txt file

In the docs directory add a requirement.txt file that includes the extra packages required for building the docs.

myst_parser
sphinx_rtd_theme

Add .readthedocs.yaml

version: "2"

build:
 os: "ubuntu-22.04"
 tools:
 python: "3.10"

python:
 install:
 - method: pip
 path: .
 - requirements: docs/requirements.txt

sphinx:
 configuration: docs/conf.py

Connect your GitHub/GitLab account to readthedocs

Go to https://readthedocs.com/
Click ‘Sign up’
Choose ‘Read the Docs Community’
Click ‘Sign up with GitHub’ (or GitLab)
Follow the instructions

Connect your project

Go to https://readthedocs.org/dashboard/
Click ‘Import a Project’
If the project is listed select it
If it is not listed click on ‘Import Manually’ and provide the requested information

Enable builds for PRs

If you want to check the doc builds from your PRs enable this by:

Go to the project dashboard on readthedocs
Select ‘Admin’
Select ‘Advanced Settings’
Click ‘Build pull requests for this project’

R Resources

Mon, 01 Jan 0001 00:00:00 +0000

R is a powerful tool that allows you to do a lot of things such as doing simple arithmetic calculations, organizing and analyzing your data, and even developing your own software. There are several, FREE resources that you can use to learn R. Below is a non-exhaustive list of tutorials, books, and videos that you can use (most of which are useful for data analysis), whether you’re just starting out or are more advanced in programming.

R at an Introductory Level

swirlR is an R package that allows you to learn R in the console at your own pace
Teacups, Giraffes, and Statistics is an interactive website that contains modules useful for learning statistics and R
R for cats is an introductory guide to the use of R with cat photos (this is a bonus!)
R in a Nutshell is a book that gives you a concise overview of the different things you can do in R
Do More with R is a website listing video tutorials on specific topics in R (most videos are said to be < 10 minutes in length)
Data Carpentry for Biologists is derived from the semester-long course taught By Dr. Ethan White that covers basic functions/use of R

Intermediate or Advanced use of R

What They Forgot to Teach You about R is a set of tips on doing effective, reproducible data analysis in R (and less about actual programming)
Happy Git and GitHub for the UseR is an instructional guide to using Git, GitHub, and R, which are all tools we use in the lab
Foundations of Statistics with R is a course book for learning probability and statistics using R
Efficient R Programming is another book that would help you increase your algorithmic and programming efficiency when using R
Advanced R is the second edition of a book that teaches you more advanced programming skills in R (it makes use of a new package called rlang, which is an interface to low-level data structures and operations
Ecological Models and Data in R is a book about building models implemented in a frequentist or Bayesian framework to answer ecological questions

Getting Help

There are times when your code might not seem to work. Apart from the brilliant lab members who you can easily reach out to through Slack for help, you can post questions/concerns in different online communities such as:

RStudio Community for R-specific questions
Stack Overflow for programming questions

Using Git From RStudio

Mon, 01 Jan 0001 00:00:00 +0000

Just a brief introduction on how to easily set up a Git/GitHub repo in RStudio:

Make sure Git is set up on your computer and in R (ask Ethan for help on this one)
Sign into your GitHub account
Click on the plus sign in the top right-hand corner (near your profile picture) and select “New repository”. You can choose to make your new repo through your account or the Weecology one on the next page.
Name your repository, add a quick description if desired, select Public or Private (most often Public) and be sure to check the box to initialize a README document. Click Create Repository.
You’ll now be on the page for your new repo. Find the green icon CODE that has either an https address or something that looks like an email address (SSH). Make sure that the box to the left of text box says HTTPS rather than SSH. Then click the clipboard icon to the right of the text box. This will copy that address to your clipboard.
Now you’re done with GitHub. Open up RStudio. Go to File and select New Project. If Git is properly set up in RStudio, you should have an option called Version Control. Click on that option.
Select the Git option from the next menu.
Paste the HTTPS address into the Repository URL box. It will auto-fill the Project directory name. Make sure the project is being created in the appropriate folder.
Click Create Project and voila!
Now you should see a Git tab next to the Environment and History tabs. From there, you can make commits, push (green up arrow), and pull (blue down arrow).

RStudio on Serenity

Mon, 01 Jan 0001 00:00:00 +0000

We have a lab server which is fairly powerful. It has 32GB of RAM and a 2 core Intel Xeon processor. This is faster and has more ram than any of our laptops. It’s good for doing things that may take too long on a desktop/laptop or as a test bed for running on the hipergator.

Note you must be on campus to use serenity, or logged in via the VPN.

Using RStudio

Open a browser and go to http://serenity.ifas.ufl.edu:8787 and login with your serenity username and password. Note this will only work when on campus. For off campus access see below.
This runs exactly like RStudio on your desktop. You’ll have to re-install any packages that you need. This works great for code that takes a long time to run. You can start something, then close the browser and Rstudio will keep running on serenity.

Logging in from off campus

There are two options to login from off campus. The first is to use the UFL VPN. Once connected you can go to the address above. The second is to access it via the hipergator login node using the steps below.

On Windows

Download the putty ssh client here.

Open putty and make a connection to the hipergator login node. Put hpg.rc.ufl.edu into the Host Name box. Put hipergator into the Saved Sessions box and click Save to save this setup. Then in the menu go to Connection -> SSH -> Auth -> Tunnels. Put 8787 in the source port and serenity.ifas.ufl.edu:8787 in the Destination and click add.

Then in the left side go back to the Session tab and click Save again. Then click open to connect. Enter your username and password. Once connected open a browser and go to http://localhost:8787

On mac or linux

SSH to the hipergator using the following command

ssh <username>@hpg.rc.ufl.edu -L 8787:serenity.ifas.ufl.edu:8787

Once logged in open a browser and go to http://localhost:8787

Ask Shawn or Henry for a username and initial password

Computer Security & Privacy

Mon, 01 Jan 0001 00:00:00 +0000

Password Managers

A password manager keeps track of the various passwords you use for different websites. (And the reason to use a different password for each account, is so that if your password at one place is compromised, it does not affect your other accounts.)

Commonly recommended password managers are Lastpass and 1password.
Lastpass has a free option, but will only sync passwords across similar device-types (e.g. between computers, or between smartphones), but not between computer and smartphone. Paid plans for Lastpass start at $2/month.
1password seems to have a nicer interface, but plans start at $2.99/month.
The common usage for password managers is as browser plug-ins that can remember passwords, generate new random passwords, and autofill.
Both Lastpass and 1password have secure notes, which allow you to keep track of answers to security questions, in case you want to generate random responses rather than use the middle school you went to, which anyone can look up.
Apple’s iCloud Keychain is another option if you have all Apple devices. Here’s a discussion on the differences with third-party password managers.
Since your email usually allows you to reset passwords, it can be preferable to NOT put your email password into your password manager, and instead to secure your email with another strong passphrase.

Passphrases

Because password managers are secured by a single master passphrase (that unlocks all your passwords), it is recommended that you use a strong passphrase.

One method is to string together random words, possibly with capitalization and special characters added.
Rather than use a website for this (who knows if it is truly random or recording the output it gives you), you can use dice and a word list.
While trying to memorize the passphrase, you might consider having a written copy that you keep somewhere safe.

Two-Factor Authentication

Two-factor authentication (commonly 2FA or MFA for multi-factor) means identifying yourself using components from at least two different categories:

something you know (e.g. a password)
something you have (e.g. a key or phone)
something you are (e.g. fingerprints, biometrics) Thus, if someone else has your password, they still can’t access your account without e.g. your phone, too.
Many large sites (e.g. banks, email) with security concerns have 2FA as an option, but it might need to be turned on. For example, here are the instructions for gmail.
Common implementations are to send a code by email or text when you log in.
Some places also support apps or devices that generate rolling codes (e.g. Google Authenticator). These generate rolling codes that rotate regularly, for example, every minute. When synced up with a website during 2FA setup, you can also use the rolling code to authenticate.
Hardware tokens are also possible, for example Yubikeys. You will want to check compatibility with services. For instance, the basic U2F key will not work as 2FA for Lastpass, and you also need a paid Lastpass plan.

Misc.

EFF has a browser extension to use encrypted HTTPS when possible.
EFF also has a browser extension to block ads from tracking you across websites.
Don’t plug in unknown USB devices to your computer or your USB devices into unknown ports. (e.g. 😱:

As an example, Hypponen said he had recently spoken to a European aircraft maker that said it cleans the cockpits of its planes every week of malware designed for Android phones. The malware spread to the planes only because factory employees were charging their phones with the USB port in the cockpit.

Because the plane runs a different operating system, nothing would befall it. But it would pass the virus on to other devices that plugged into the charger.

You can use incognito / private mode when browsing to bypass some paywalls (e.g. NYT’s 10-article/month limit)
OverSight comes recommended by some folks as a way to notify about microphone and camera usage on Macs.

Software Testing in R

Mon, 01 Jan 0001 00:00:00 +0000

https://github.com/weecology/rtestpackage

This repo contains a description of how to setup the test environment for R.

There are additional references from where the information was derived.

Statistics for Software Use

Mon, 01 Jan 0001 00:00:00 +0000

It can be helpful to know how frequently your software is being downloaded to assess it’s use and report on its impact. Below are instructions for how to do this for different languages. Keep in mind that downloads can be influenced by automated testing systems if they install the software from the central repository.

R

These instructions get downloads from the cloud CRAN mirror. This is a minimum estimate of total downloads.

Install the cranlogs package install.packages("cranlogs")
Run the cran_downloads function with your package name and date ranges if desired

downloads = cranlogs::cran_downloads(packages = c("portalr"), from = "2020-02-26", to = "2020-08-16")
total_downloads = sum(downloads$count)

Python

Python packages are often distributed using both PyPI and conda so you might need to get download statistics for both.

PyPI

Using the PyPI Stats website

PyPI Stats is easy and provides the last 6 months of data.
E.g., https://pypistats.org/packages/deepforest

Using Google BigQuery

Setup a Google BigQuery account
Run a version of this query where with additional modifications as necessary

SELECT COUNT(*) downloads
FROM `the-psf.pypi.downloads2020*`
WHERE file.project='retriever'

Conda

Install the condastats package conda install -c conda-forge condastats
From the command line run a version of the following command

condastats overall retriever --start_month 2020-01 --end_month 2020-08

SSH

Mon, 01 Jan 0001 00:00:00 +0000

What is SSH?

SSH (short for “secure shell”) is a computer protocol for encrypting access to computers over a network. Typical use cases are:

accessing HiPerGator and Serenity servers from a laptop or desktop machine, while on the UF network
cloning, pulling, or pushing to GitHub repositories

What software do I need?

SSH should be pre-installed with Mac OS, Windows 10, and Linux. On other versions of Windows, you might want to download Git Bash which also contains some other useful tools.

What is an SSH key?

Access to servers through SSH involves authenticating yourself with a username and password (the same as if you were logging in directly to the machine).

Alternatively, you can generate a digital “key” that provides access to the keyholder. Typically, you generate an SSH key for your computer, and then set up the appropriate file on the server(s) you wish to access. Then, instead of authenticating yourself with your username and password for the server, you instead use the SSH key, which is verified by the paired file previously copied to the server. (You can also provide a passphrase in order to use your SSH key, but as it’s stored on your laptop or desktop and requires you to be logged in, this security step is optional - opinions vary on this.)

Setup Instructions (GitHub)

GitHub has fairly detailed instructions on creating a new SSH key;

and further instructions to enable it for use on Github.

Setup Instructions (Serenity / HiPerGator / generic)

To setup an SSH key for use on Serenity, we assume you have gone through the steps described in the instructions for GitHub to create an SSH key. This creates both the key itself (usually located in ~/.ssh/id_rsa), and a paired file that verifies that the key is correct (usually located in ~/.ssh/id_rsa.pub).

If you are setting up an SSH key on HiPerGator, note that you probably already have one in the default location, which is used for communicating between different HiPerGator nodes. You may only need to follow the instructions to enable its usage for github, as well.

Log in to the server you wish to use the SSH key on.
Edit the ~/.ssh/authorized_keys file on the server. It may already have contents, in which case, go to a new, blank line.
Copy over the contents of the ~/.ssh/id_rsa.pub from your local computer. It will look something like:

ssh-rsa *************long-string of random characters************* <email address>

(If you see something like: -----BEGIN RSA PRIVATE KEY-----, you are using the wrong file!!)

Save the modified ~/.ssh/authorized_keys file on the server.
Try to connect using ssh. In the command line on your local computer:

ssh <username>@<server>

You should authenticate without having to enter your password.

Workflow Tools - snakemake & targets

Mon, 01 Jan 0001 00:00:00 +0000

Introduction

Workflow tools let you automate the running and rerunning of code with multiple steps. For example we use it for managing our image processing workflow for our Everglades research.

Python - snakemake

Getting Started

Handling complex inputs with input functions

In our workflows with deal with complex input-output structures, like having early phases of the pipeline work on one flight (file) at a time and later phases work on all of the files from a given site and year as a group.

This can be accomplished by defining custom input functions.

Testing snakemake with partial wildcards

When testing big workflow it is often useful to run the workflow on a subset of the data. For example our Everglades workflow runs on all years, sites, and flight at once, but we might want to test a site year-site combination when making a change. To prepare to do this replace your Wildcards object with the component lists for the main workflow. E.g.,

ORTHOMOSAICS = glob_wildcards("/{year}/{site}/{flight}.tif")
FLIGHTS = ORTHOMOSAICS.flight
SITES = ORTHOMOSAICS.site
YEARS = ORTHOMOSAICS.year

The components are just lists, so you can then replace them with whatever pieces of the full workflow you want to test. E.g.,:

ORTHOMOSAICS = glob_wildcards("/{year}/{site}/{flight}.tif")
TEST = glob_wildcards("/blue/ewhite/everglades/orthomosaics/2022/StartMel/{flight}.tif")
FLIGHTS = TEST.flight # ORTHOMOSAICS.flight
SITES = ["StartMel"] * len(FLIGHTS) # ORTHOMOSAICS.site
YEARS = ["2022"] * len(FLIGHTS) #ORTHOMOSAICS.year