Site Reliability Engineer - Vice President

Singapore, Singapore
16 Sep 2022
27 Sep 2022
Job Function
Industry Sector
Finance - General
Employment Type
Full Time
About Us

J.P. Morgan is a global leader in financial services, providing strategic advice and products to the world's most prominent corporations, governments, wealthy individuals and institutional investors. Our first-class business in a first-class way approach to serving clients drives everything we do. We strive to build trusted, long-term partnerships to help our clients achieve their business objectives.

We recognize that our people are our strength and the diverse talents they bring to our global workforce are directly linked to our success. We are an equal opportunity employer and place a high value on diversity and inclusion at our company. We do not discriminate on the basis of any protected attribute, including race, religion, color, national origin, gender, sexual orientation, gender identity, gender expression, age, marital or veteran status, pregnancy or disability, or any other basis protected under applicable law. In accordance with applicable law, we make reasonable accommodations for applicants' and employees' religious practices and beliefs, as well as any mental health or physical disability needs.

About the Team

The Corporate & Investment Bank (CIB) is a global leader across investment banking, wholesale payments, markets and securities services. The world's most important corporations, governments and institutions entrust us with their business in more than 100 countries. We provide strategic advice, raise capital, manage risk and extend liquidity in markets around the world.

Within the CIB Athena is the strategic platform for risk, pricing, and trade management. As a platform it is made up of a client, a job scheduling engine, an eventually consistent distributed database, and a software development lifecycle toolset. Underpinning this is a layer of config and processes which broadly abstract the end user from the complexities of infrastructure lifecycle management and connectivity concerns.

As an SRE your challenge is to help us build a Telemetry platform ( Rationalizing many different current toolsets into a coherent whole ) . Our current strategy is based on a hierarchy of Metrics / Traces / Logs / Synthetic Tests primarily aligning with a subset of CNCF projects.

However the role will require hands L3 support of issues and time spent using the current "eyes on glass" alerting tooling, to better understand the problems we are solving and to help move operational concerns and resiliency patterns to the top of the whole departments priorities.

This is not an automation role, there are release / infra or code deployment teams outside of SRE, SRE will automate their processes but your role will not be to automate the unfinished code of others.

As a Site Reliability Engineer (SRE) you will help build a meaningful engineering discipline, combining software and systems to develop creative engineering solutions to operations problems. Much of our support and software development focuses on optimizing existing systems, building infrastructure and reducing work through automation. You'll join a team of curious problem solvers with a diverse set of perspectives who are thinking big and taking risks. In this environment you'll take the lead on relevant projects, supported by an organization that provides the support and mentorship you need to learn and grow. As an SRE you'll be focused on running better production applications and systems.

  • Troubleshoot priority incidents, facilitate blameless post-mortems and ensure permanent closure of incidents
  • Engage with development team throughout the life cycle to help develop software for reliability and scale, ensuring minimal refactoring or changes
  • Identify application patterns and analytics in support of better service level objectives
  • Design self-healing and resiliency patterns
  • Coach or manage teams as applicable
  • Willingness to join incident calls
  • Design, implement and support a telemetry strategy across multiple platforms
  • Ability to work within boundaries set by a wider org, whilst also challenging constructively external constraints.
This role requires a wide variety of strengths and capabilities, with a mixture of experiences from below:
  • Bachelor's degree
  • 5 years hands on coding expertise in at least one current technology stack designing, coding, testing, and delivering software
  • 3 years hands on experience of supporting business critical large-scale production environments
  • 2 years Hands-on experience with Kubernetes in a public cloud environment , ideally EKS or GKE
  • 4 years' experience working with time series data for monitoring and observability (Prometheus, Influx , TSDB, KBD)
  • 2 Years' experience working with any one of the CNCF distributed tracing projects
  • 3 Years' experience working with any one of the CNCF logging projects
  • 3 years Working knowledge of infrastructure components. (E.g. routers, load balancers , cloud products , container systems , compute, storage and networks)
  • Hands on experience of running Chaos Engineering tests - preferably in production
  • Hands on experience using SLO & SLIs to support a production environment
  • You need to sign in to save