HPC Systems Administrator

Twitter Facebook
Location
Provo, UT
Job Type
Direct Hire
Date
May 07, 2019
Job ID
2668667
The HPC Systems Administrator is an integral part of the operations of Research Computing. You will be responsible for architecting, installing, configuring, and maintaining the department infrastructure in cooperation with fellow systems administrators and other staff members. You will be responsible for High Performance Computing (HPC) clusters, high performance storage, Ethernet and Infiniband networks, OS image deployment, batch job scheduling, infrastructure servers, and other ancillary services. You will have significant latitude to make technical decisions and guide the direction of Research Computing.
This position supports the Research Computing in its mission to provide reliable, state-of-the-art HPC resources to researchers. Current resources include about 24,000 processor cores and petabytes of storage. All Research Computing systems use Linux.
To do this, you must be skilled in many IT-related fields, especially in the administration of Linux systems, and already possess many of the skills listed below. High Performance Computing is a field that requires the combination of many specialties and skills. You need to be willing, able, and proactive about acquiring any listed skills that you may currently lack.
Skills and Experience
Minimum qualifications: Bachelor's degree or four years of a combination of education and experience.
Required skills and experience:
  • Excellent Linux or Unix skills
  • Capability and desire to learn new skills
  • Good verbal and written communications skills
  • Systems programming skills (e.g. Python, Perl, bash, etc.)
Desired skills and experience:
  • Linux/Unix systems administration
  • Compiled languages (e.g. C, C++, Fortran)
  • Advanced Unix/Linux shell scripting (e.g. bash, tcsh)
  • Scripting languages (e.g. Perl, Python)
  • Administration of parallel file systems or enterprise-class storage (e.g. Lustre, SAN, NAS)
  • Installation, configuration, monitoring, maintenance of Ethernet and Infiniband networks
  • Various server types (e.g. web, DNS, database, mail servers)
  • Virtualization
  • Hardware monitoring (IPMI, SNMP, etc.)
  • Batch job scheduling systems (e.g. Slurm, Moab/Torque, LSF)
  • Backup systems (e.g. Bacula, TSM)
  • MySQL administration
To succeed in this position, you will need to be dedicated and be able to work well with others. You must be very proactive and pay great attention to detail. You will work with the director, the other system administrators, and user support staff to accomplish the mission of the department.
Responsibilities include:
  • Design, implementation, monitoring, and maintenance of:
    • HPC clusters
    • Networks: Ethernet and Infiniband
    • Centralized storage and backups
    • Various infrastructure services that support the Office of Research Computing (virtualization, web, databases, DNS, etc.)
    • Availability of services
  • Evaluate the acquisition of hardware and software solutions
  • Hire and manage a student hardware technician
  • Assist user support staff as needed, especially to track down potential system issues
  • Automate routine tasks
  • Investigate and propose new methods to improve lab operations
This position has on-call responsibilities. However, after-hours outages have been rare and maintenance is typically performed during business hours.
 
Export-Control Regulations:
 
Research Computing supports projects with various export-control restrictions. Employment is restricted to US citizens and lawful permanent residents.