Site Reliability Engineer, Observability
Ripple · New York, NY, United States
$160–200k
mid
site reliabilityreliability engineer
Apply on Ripple →
At Ripple, we’re building a world where value moves like information does today. It’s big, it’s bold, and we’re already doing it. Through our crypto solutions for financial institutions, businesses, governments and developers, we are improving the global financial system and creating greater economic fairness and opportunity for more people, in more places around the world. And we get to do the best work of our career and grow our skills surrounded by colleagues who have our backs.
If you’re ready to see your impact and unlock incredible career growth opportunities, join us, and build real world value.
Ripple Treasury, now a Ripple solution, was acquired by Ripple in 2025, marking a significant expansion into the multi-trillion-dollar corporate finance arena.
Ripple Treasury has more than 40 years of experience supporting some of the world’s largest and most sophisticated companies. Integrating its treasury command center into Ripple’s technology stack gives corporates the ability to move, manage and optimize liquidity in real-time, across traditional and digital assets, under one expanded umbrella.
Join us to build the future of corporate treasury and the infrastructure that powers the Internet of Value.
THE WORK:
As a Site Reliability Engineer you will be a force multiplier elevating engineering capabilities across observability and incident management. You will empower Ripple's stream-aligned engineering teams to detect, diagnose, and resolve production issues quickly and effectively—helping keep our products highly available, performant, and resilient at scale for customers managing trillions in annual payment volume. You will be part of Ripple's Technical Operations team, coaching teams to build comprehensive monitoring, effective alerting, and mature incident response practices. Through workshops, consultation, and hands-on guidance, you'll help teams achieve operational excellence and self-sufficiency. If you're passionate about building capabilities in others and creating lasting impact through observability and incident management, this is the opportunity for you.
WHAT YOU’LL DO:
Observability Enablement
Coach teams on instrumenting applications with structured logs, metrics, and distributed traces using New Relic and OpenTelemetry
Guide teams in creating effective dashboards, alerts, and SLOs/SLIs that provide actionable insights into system health and reduce Mean Time to Detection (MTTD)
Teach teams to define and track error budgets, using them to balance feature velocity with reliability
Provide hands-on guidance during production incidents to coach real-time troubleshooting using observability data
Develop golden path examples for instrumentation patterns, dashboard templates, and alert configurations that teams can adopt independently
Help teams optimize their use of New Relic (APM, Infrastructure, Logs, Synthetics) across Azure and AWS multi-cloud environments
Build team capability to identify and resolve performance bottlenecks, resource constraints, and degradation patterns
Incident Management Administration & Enablement
Administer and configure the Incident.IO platform, ensuring it supports effective incident response workflows across all engineering teams
Coach teams on incident response best practices: classification, escalation, communication, coordination, and resolution
Help teams establish on-call rotation schedules, runbooks, and escalation policies that ensure appropriate incident coverage
Facilitate post-incident review (PIR) processes, teaching teams to identify root causes, document learnings, and implement preventive measures
Guide teams in defining incident severity levels and response procedures aligned with business impact
Integrate observability tooling (New Relic) with incident management ( Incident.IO ) to enable rapid detection and diagnosis
Track and report on incident metrics (MTTR, MTTD, incident frequency) and help teams drive continuous improvement
Facilitate incident management simulations (game days, failure injection exercises) to build team readiness
Cross-Functional Impact
Enable 4-6 teams per quarter to successfully adopt improved observability or incident management practices through workshops, consultation, and hands-on guidance
Identify and remove operational bottlenecks in monitoring and incident response, helping teams reduce MTTR and improve reliability
Collaborate with the Subsystems Platform Team to translate common needs into self-service observability and incident management capabilities
Facilitate knowledge sharing through documentation, training materials, and communities of practice that build lasting team competence
Measure and track team progress on observability maturity and incident management effectiveness, demonstrating measurable improvement
Work across Azure (80%) and AWS (20%) environments, supporting teams operating on both Windows (80%) and Linux (20%) infrastructure
WHAT YOU'LL BRING:
Core SRE Experience
5+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering with strong focus on observability and production operations
Proven ability to coach and mentor engineering teams with excellent communication and teaching skills across technical and non-technical audiences
Consultative mindset with the ability to influence and guide teams without direct authority
Experience working in Agile/Scrum environments and collaborating with cross-functional teams
Observability Expertise (Required)
Expert-level hands-on experience with New Relic (APM, Infrastructure Monitoring, Logs, Synthetics, Alerts) and strong proficiency writing NRQL queries for troubleshooting
Proven experience implementing instrumentation in application code (OpenTelemetry, Serilog, or similar frameworks)
Deep understanding of structured logging, metrics collection (RED/USE methods), distributed tracing, and creating effective dashboards and alerts
Expertise defining and implementing SLOs/SLIs and error budgets for reliability management
Demonstrated ability to troubleshoot complex production issues using observability data across distributed systems
Incident Management Expertise (Required)
Hands-on experience with incident management platforms ( Incident.IO , PagerDuty, Opsgenie, or similar)
Proven track record managing and facilitating production incidents from detection through resolution
Experience designing and implementing incident response processes, escalation policies, and on-call rotations
Strong facilitation skills conducting post-incident reviews (PIRs/postmortems) that drive actionable improvements
Understanding of incident severity classification, SLA/SLO breach procedures, and customer impact assessment
Infrastructure & Tools Experience (Required)
Strong experience with Azure cloud platform (App Services, Virtual Machines, Azure SQL, networking, monitoring) and working knowledge of AWS services
Experience with both Windows and Linux server environments
Familiarity with Infrastructure as Code (Terraform) for provisioning monitoring resources
Experience with Azure DevOps, Octopus Deploy, and GitHub in the context of deployment visibility and change tracking
Understanding of how deployment practices impact observability and incident response
Additional Valued Experience
Experience measuring and improving key reliability metrics (MTTR, MTTD, availability, error budgets) across engineering organizations
Experience building and scaling on-call practices across multiple teams
Background facilitating chaos engineering or game day exercises to build team resilience
Experience with Jira for incident tracking and workflow automation
Knowledge of VM-hosted SQL Server monitoring and performance optimization
Familiarity with FinTech compliance requirements (SOC 2, ISO 27001) and audit evidence collection
Experience building communities of practice around observability and incident management
Industry certifications such as New Relic Programmability Certification, AWS/Azure certifications, or SRE/DevOps certifications
Experience with scripting languages (PowerShell, Python, Bash) for automation and observability instrumentation
Other common names for this role: Senior Site Reliability Engineer, Observability Engineer, Incident Management Engineer
For positions that will be based in NY, the annual salary range for this position is below. Actual salaries may vary based on numerous factors including, among other things, an individual applicant’s experience and qualifications for the position. This range does not include equity or additional compensation, such as bonuses or commissions.
NY Annual Base Salary Range
$160,000 — $200,000 USD
WHO WE ARE:
Do Your Best Work
The opportunity to build in a fast-paced start-up environment with experienced industry leaders
A learning environment where you can dive deep into the latest technologies and make an impact. A professional development budget to support other modes of learning.
Thrive in an environment where no matter what race, ethnicity, gender, origin, or culture they identify with, every employee is a respected, valued, and empowered part of the team.
In-office collaboration for moments that matter is important to our culture, and we give managers and teams the flexibility to decide which 10+ days a month they come in.
Bi-weekly all-company meeting - business updates and ask me anything style discussion with our Leadership Team
We come together for moments that matter which include team offsites, team bonding activities, happy hours and more!
Take Control of Your Finances
Competitive salary, bonuses, and equity
Competitive benefits that cover physical and mental healthcare, retirement, family forming, and family support
Employee giving match
Mobile phone stipend
Take Care of Yourself
R&R days so you can rest and recharge
Generous wellness reimbursement and weekly onsite & virtual programming
Generous vacation policy - work with your manager to take time off when you need it
Industry-leading parental leave policies. Family planning benefits.
Catered lunches, fully-stocked kitchens with premium snacks/beverages, and plenty of fun events
Benefits listed above are for full-time employees.
Ripple is an Equal Opportunity Employer. We’re committed to building a diverse and inclusive team. We do not discriminate against qualified employees or applicants because of race, color, religion, gender identity, sex, sexual identity, pregnancy, national origin, ancestry, citizenship, age, marital status, physical disability, mental disability, medical condition, military status, or any other characteristic protected by local law or ordinance.
Please find our UK/EU Applicant Privacy Notice and our California Applicant Privacy Notice for reference.
Posted 2026-06-18