Stanford Research Computing is looking for a talented system administrator to join our team of collaborative and innovative professionals helping Stanford's faculty and students use advanced computing and data tools to explore new frontiers in knowledge and solve some of humanity's most urgent problems. Our staff work directly with some of the world's top researchers in a broad range of disciplines, across all of Stanford's seven schools -- while also supporting and learning from each other in cross-project endeavors. We maintain and steadily improve an advanced research computing facility, and we support a variety of environments for Stanford research. In Stanford Research Computing, you'll have a rare opportunity to contribute to discoveries and inventions that have global reach and positive impact, and to share in the curiosity and commitment of the scholars and scientists who lead these projects.

This new position will support Stanford's world-class data science and AI-focused research by managing and administering an NVIDIA DGX SuperPod instrument. You and another HPC administrator will partner closely with a team of data scientists from Stanford Data Science to ensure that the GPU cluster environment is configured and operated to maximize research productivity. We'd love to have you join us on this exciting journey.

RESPONSIBILITIES
This role is primarily systems-facing. In this position, you will put to use your in-depth knowledge of Slurm and Linux, your HPC cluster administration experience, and your passion for supporting ground-breaking research on a daily basis. You will play a crucial role in optimizing, improving and sustaining our advanced computing infrastructure.

HPC Infrastructure Maintenance: Help manage the day-to-day system administration of an NVIDIA DGX Superpod and associated storage, management and networking infrastructure, in alignment with applicable university, regulatory agency, and/or contractual security and privacy requirements and instrument governance group decisions.
Slurm: With peer administrator, configure and manage Slurm for efficient resource allocation and job scheduling across the cluster, consistent with faculty guidance on system resource usage and utilization.
GPU Resource Management: Manage GPU resources within the cluster, optimizing utilization for compute-intensive tasks while maintaining a balance between user requirements and system stability. Provide automated, easily accessible resource utilization metrics.
User Support: Collaborate with Stanford Data Science team members and system users to understand their computing needs, provide technical assistance, and troubleshoot issues related to system performance and job execution. Provide user consultation and training in system use as needed.
Performance monitoring: Monitor system performance, diagnose bottlenecks, and take necessary actions to improve system performance.
Documentation: Maintain detailed documentation of system configurations, procedures, and troubleshooting guides to facilitate knowledge sharing and team collaboration. Develop user-facing documentation in coordination with colleagues from Stanford Data Science.
Planning: Meet regularly with stakeholders to understand existing challenges, anticipated needs, and opportunities for closer collaboration.
Vendor engagement: Liaise with system vendors and other external partners as needed to ensure system issues are triaged and resolved expeditiously and correctly.

MINIMUM REQUIREMENTS Education and Experience :

Bachelor's degree and eight years of relevant experience, or a combination of education and relevant experience. Eight years of increasingly technical work experience preferred.
In-depth experience managing complex multiuser HPC clusters and storage environments is necessary, as is experience managing GPU-based infrastructure.

Qualifications :
This position requires in-depth knowledge of and substantial hands-on experience with:

HPC cluster system administration, preferably in an academic/research environment
GPU technologies and their integration into HPC environments (driver management, software stack tools, monitoring)
Infiniband (driver management, software stack tools, monitoring)
Container platforms (ex: Apptainer)
Slurm configuration and management
NFS-based storage management and configuration
High-performance parallel filesystem (Lustre) management and configuration
Scripting for system management, monitoring and task automation
Installing and repairing servers and associated cluster hardware
Complex technical problem-solving and troubleshooting, with a proactive approach to system optimization and issue resolution
Security practices and compliance standards in a computing environment
Collaborating effectively across teams and with researchers

Additional desired skills and experience include :

AI/ML software and frameworks, deep learning, and LLM training
Bright Cluster Manager
Pyxis/enroot
CUDA
System and storage benchmarking
DataDirect Networks (DDN) SFA high-performance storage systems

Working Conditions
This is a hybrid position, in which you will work on-site at the Stanford campus for a minimum of 3 days a week through the first 9 months of employment, and at least 2 days a week thereafter.

You will be expected to travel to the research data center (3 miles away, on the SLAC campus) as needed to meet with and escort vendor technicians; inspect and troubleshoot hardware; receive and install FRU components for the system; and manage any RMAs. Typically, Stanford service vehicles can be checked out and used for your travel between the Stanford campus and the data center. Note that availability and ability to travel to/from the data center is required for all work days (not only on-site days) and in emergency off-hours situations.

Our core work hours are 9 am - 5 pm Pacific. This role occasionally will require extended hours and weekend work, and you will participate in rotation of on- and off-site responsibilities during the annual winter closure. Periodically, the data center is shut down for required maintenance. All team members with system responsibilities are expected to be physically on-site to return services to production status at the end of any planned facility outage.

The expected pay range for this position is $148,162 to $168,602 per annum.

Stanford University provides pay ranges representing its good faith estimate of what the university reasonably expects to pay for a position. The pay offered to a selected candidate will be determined based on factors such as (but not limited to) the scope and responsibilities of the position, the qualifications of the selected candidate, departmental budget availability, internal equity, geographic location and external market pay for comparable jobs.

At Stanford University, base pay represents only one aspect of the comprehensive rewards package. The Cardinal at Work website ( https://cardinalatwork.stanford.edu/benefits-rewards ) provides detailed information on Stanford's extensive range of benefits and rewards offered to employees. Specifics about the rewards package for this position may be discussed during the hiring process.

Why Stanford is for You :
Imagine a world without search engines or social platforms. Consider lives saved through first-ever organ transplants and research to cure illnesses. Stanford University has revolutionized the way we live and enriched the world. Supporting this mission is our diverse and dedicated 17,000 staff. We seek talent driven to impact the future of our legacy. Our culture and unique perks empower you with:

Freedom to grow . We offer career development programs, tuition reimbursement, and course auditing. Join a TedTalk, watch a film screening, or listen to a renowned author or global leaders speak.
A caring culture . We provide superb retirement plans, generous time-off, and family care resources.
A healthier you . Choose from hundreds of health or fitness classes at our world-class exercise facilities. We provide excellent health care benefits.
Discovery and fun . Stroll through historic sculptures, trails, and museums.
Enviable resources . Enjoy free commuter programs, ridesharing incentives, discounts and more.

We look forward to receiving your application and cover letter.

The job duties listed are typical examples of work performed by positions in this job classification and are not designed to contain or be interpreted as a comprehensive inventory of all duties, tasks, and responsibilities. Specific duties and responsibilities may vary depending on department or program needs without changing the general nature and scope of the job or level of responsibility. Employees may also perform other duties as assigned.

Consistent with its obligations under the law, the University will provide reasonable accommodations to applicants and employees with disabilities. Applicants requiring a reasonable accommodation for any part of the application or hiring process should contact Stanford University Human Resources by submitting a contact form .

Stanford is an equal employment opportunity and affirmative action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic protected by law.

Additional Information

Schedule: Full-time
Job Code: 4833
Employee Status: Regular
Grade: K
Requisition ID: 105455
Work Arrangement : Hybrid Eligible

GPU Cluster System Administrator - Military Veterans

at HERC- Upper MidWest

Stanford, CA