HPC System Integration Architect (Scientist 3/4) in Los Alamos, New Mexico

Vacancy Name: IRC69749




Job Title HPC System Integration Architect (Scientist 3/4)


Location Los Alamos, NM, US

Organization Name HPC-DES/HPC Design

What You Will Do


The High-Performance Computing (HPC) Division at Los Alamos National Laboratory provides scientific computing resources consisting of some of the largest HPC systems in the world. The systems team within the HPC Design Group (HPC-DES) is responsible for defining the technical direction, evaluating, developing and deploying the tools and system software ultimately used in production support of LANL’s HPC resources. These HPC resources are some of the largest in the world and currently include a large (19K+ node) Cray system called Trinity as well as numerous large commodity cluster systems.


We seek candidates who want to make significant contributions that impact the HPC technical direction at LANL and ultimately across the DOE and the nation.


For consideration, applicants should submit a cover letter addressing how their knowledge, skills and abilities meet the minimum requirements along with a resume.


This position will be filled at the Scientist 3 or 4 level as dictated by the current programmatic needs and skills of the selected candidate. Job responsibilities will be assigned in accordance with the level at which the selected candidate is hired.

What You Need

Minimum Job Requirements:


Computer Scientist (Scientist 3: $96,600-$161,300)


The successful candidate will be required to:

  • Demonstrate a record of accomplishment and expertise in high performance and large-scale systems integration of diskless clusters and/or file system.

  • Have knowledge and experience with HPC system hardware definition, characterization, specification, acquisition, deployment, and production readiness.

  • Have practical experience at the advanced level in programming such as Bash scripts, shell scripts, perl, CFengine and Python code.

  • Possess good oral and written communication skills.

  • Demonstrated ability to work within a team environment.

  • Have a working knowledge of networking concepts and practices.

  • Have knowledge of or experience with hardware and software security practices.


  • Relocation benefits will cover your move

  • Top health insurance coverage, free dental and vision

  • Flexible work schedules aimed at work life balance

  • 3 weeks vacation and 2 weeks sick time

  • Free Wellness centers including classes, bike trials, and gyms

  • Ergonomics evaluation ensuring comfortable work stations and increased energy

  • 9.5% company contribution towards 401k

Life at the Lab
  • 42 Miles of trails around the LANL

  • Rated number 1 county to live in by USA Today

  • Top 5 public school in the country.

  • 17 Ski basis within 100 miles of the lab

  • Dress- a casual dress environment

About the position:

The selected Scientist 3 will be working closely with other HPC-DES System team members as well as more production focused team members in other groups in the HPC division. Projects typically involve collaborations inside and outside of the Laboratory that are in line with the Laboratories’ history of leadership in HPC. The candidate would be tasked to identify current and future challenges faced by large scale HPC applications, and work toward production HPC system solutions. In particular, this individual will help design, develop, deploy, and support system software to overcome these challenges. Areas of interest include distributed systems, configuration management, data aware scheduling, resource allocation, metadata collection, parallel file-systems, workflow management, and visualization. This individual would also set direction, goals, milestones, and deliverables for project tasks and establish associated scope, schedule and budgets. It is expected that whoever filled this position would provide Tier 3 support to system admin staff and help desk staff on various HPC production systems, when required by user feature requests, bugs, or security vulnerabilities that cannot be resolved by production teams.

Desired Skills:

  • Demonstrated in-depth experience with Lustre or GPFS.

  • Practical experience at the advanced level in programming using C, C++ and/or Fortran.

  • Practical experience with proprietary interconnects such as the Cray Aries or Gemini network or other proprietary networks.

  • Experience with deploying software defined networks. (SDN/ NFV).

  • Practical experience with OpenHPC.

  • Practical experience with power aware computing and scheduling.

  • Practical experience with deployments of NVRam and other flash technologies.

  • Experience in anticipating needs for hardware and software environments.

  • Extensive experience in Linux with complete understanding of configuration files, building of diskless nodes, modifying kernel parameters and making a new kernel, and "Kickstart" files to automate installations.

  • Ability to creating reliable/repeatable procedures for production use.

  • Practical experience and the advanced knowledge of ethernet switches, routing, TPC/IP, and configuration of NICs and routers.

  • Practical experience and advanced knowledge of system Interconnects, especially Infiniband and know how to configure on hosts and switches.

  • Practical experience in taking a large cluster and making it's OS and software "Production" quality. (How to “harden” a Linux system)

  • Practical experience with of Slurm.

  • Demonstrated experience leading multi-person projects to meet scope, schedule and budget.

  • Demonstrated experience in formulating and presenting results to technical audiences and readerships.

  • Experience managing computers in a DOE or DOD classified environment.

  • Active DOE Q Clearance.