This guide provides an overview of the services provided by the HPC. If you have any more questions regarding the HPC and usage, please email us at [email protected]



Content:


  1. Compute per Project
  2. Storage per Project
  3. Backup and Restore
  4. Scheduling/Prioritisation
  5. Availability, Maintenance, and Unplanned Disruption
  6. Monitoring and Arbitration
  7. Costs
  8. Support and Documentation

Compute per Project

(return to top)


  • Access to up to 200 job slots/CPU cores at a time, subject to availability. This can be increased by arrangement.
  • Ability to run jobs with RAM allocation of up to 200GB

Storage per Project

(return to top)


  • 50GB of resilient, backed-up personal home space.
  • In addition, users are allocated a quota of 1TB scratch space for working storage.
  • Additional Vault storage may be available by negotiation. Vault facilitates longer-term storage on Maxwell. It is useful when projects need to repeatedly use the same data over time.


Backup and Restore

(return to top)


Storage
HPC Backup and Restore Policy
Home Space
Data is backed up as follows:
  • Daily backups kept for 14 days<
  • Weekly backups kept for 1 week
Files can be restored as follows:
  • To the path that folder/files came, or
  • To a different specified HPC file path, or
  • To a specified shared drive

Request process:
Shared scratch
Not backed up
Vault
Not backed up


Scheduling/Prioritisation

(return to top)


The cluster runs Slurm workload manager to automatically allocate jobs submitted by users onto available compute nodes. Projects that supply grants or other funds to support Maxwell's usage are given priority on the scheduler.

  • The scheduler balances the availability of slots among all users to permit fair access to the system.
  • It considers the specific requirements of each job (e.g., number of CPUs, amount of RAM, job duration, and node affinity requirements) and prioritisation.
  • The scheduler starts queued jobs as space becomes available and can be set to advise users of job status by email.
  • Larger jobs requiring more time and resources are more difficult to schedule, so it is to the advantage of all users to make sure the requested resource is as accurate as possible.
  • Smaller jobs will be scheduled to run/backfill into available space and may therefore started/completed earlier than larger jobs.


An awareness of the following will help users provide accurate information when scheduling a job:

  • When insufficient memory is requested for a job, the job will not run and will need to be rescheduled with more memory requested.
  • Where more memory is requested than is used, the user will have the full amount allocated to them as this resource is blocked and is not usable elsewhere.
  • The default runtime of any job is 24 hours
    • If more time is needed this must be explicitly stated.
    • Less time can also be requested.
  • When insufficient time is allocated to a job, the job will be stopped when the allocated time has elapsed and will need to be rescheduled with more time requested.
  • Where the actual time used is less than the time requested, only the actual time will be attributed to a user’s account.
  • Interactive jobs can run only when the requested resources (e.g. CPUs and memory) are immediately available on Maxwell.
  • Once scheduled, data on the job are available from Maxwell using the ‘squeue’ command. This advises users of the status of a running job or the priority of a queued job.

Availability, Maintenance, and Unplanned Disruptions

(return to top)


  • Maxwell is designed to ensure maximum availability and continued run time even if some cores or nodes stop functioning correctly. When this happens, these issues will be resolved, as far as possible, without any additional disruption to the cluster.
  • Planned maintenance will be communicated in advance to all users and will be scheduled to cause minimum disruption.
  • Every effort will be made to ensure there are no unplanned disruptions to the service. Where events, either internal or external to the HPC, do cause disruption, every effort will be made by Digital Research to restore service as quickly as possible. This may involve work with our suppliers.


Monitoring and Arbitration

(return to top)


  • The Digital Research Services Team is responsible for monitoring the use of the system and should be contacted via [email protected] to resolve any perceived scheduling or prioritisation issues.

Costs

(return to top)


Funded Projects


  • The HPC uses a queuing algorithm which prioritises funded projects
  • Costs must be included in Worktribe grant applications:
    • £100 minimum per funded project (HPC account, set up support, and up to 1000 core hours CPU)
    • 10p per core hour for compute
    • plus 10p per core hour per GPU (where GPU required)
    • £400 per day for additional support (eg installation and troubleshooting of bespoke applications)
    • Additional storage by negotiation.
  • For a quote, please contact digitalresearch@abdn.ac.uk


Unfunded Projects


Limited free use (up to 1000 core hours) for:

  • Small pilot projects
  • Unfunded PGR projects
  • Unfunded UG/MSc projects


Test the HPC


Free of Charge (up to 500 core hours). 


Support and documentation

(return to top)



To see all HPC documentation, you have to be logged in to Fresh Service


For additional support, please contact [email protected]