Senior Product Manger - Tech, Infrastructure Reliability
Company: Amazon
Location: Austin
Posted on: April 1, 2026
|
|
|
Job Description:
Join Amazon's Fulfillment Technologies & Robotics (FTR) team to
spearhead the product vision for a platform that ensures Amazon's
fulfillment network never stops — even as we move toward fully
self-governing, zero-touch operations. You'll own the roadmap for
an AI-powered infrastructure reliability platform that prevents,
detects, and resolves incidents across thousands of fulfillment
sites globally. This is a rare opportunity for a technically deep
product leader who can write code, deliver proof-of-concepts, and
engage as a peer with data scientists and engineers. You will shape
how LLMs, multi-agent systems, and machine learning are applied to
one of the most operationally critical platforms Amazon has ever
built — and your hands-on technical contributions will directly
accelerate the team's ability to move from idea to production. Key
job responsibilities - Own and drive the multi-year product roadmap
for the Infrastructure Reliability AI-Ops platform, spanning three
strategic programs: zero-touch incident resolution,
associate-directed work tooling, and predictive failure prevention.
This means defining the vision, strategy, and success metrics for
AI-powered progressive detection, incident consolidation,
self-governing remediation orchestration, and cross-domain
observability capabilities that serve thousands of fulfillment
sites globally. - Go beyond traditional product management by
writing code and delivering working proof-of-concepts that validate
technical hypotheses before committing engineering resources.
Whether prototyping a multi-agent reasoning pipeline, exploring a
new anomaly detection approach, or stress-testing an LLM prompt
chain against real incident data, you will use your technical
skills to compress the distance between idea and validated
direction. - Bring deep knowledge of machine learning fundamentals
and apply that knowledge to shape how the platform detects,
consolidates, and reasons about failures. You will engage
meaningfully with data scientists on model architecture selections,
feature engineering tradeoffs, and evaluation frameworks —
understanding not just what a model produces but why, and whether
that reasoning can be trusted in a production environment where
self-governing remediation choices carry real operational risk. -
Apply your understanding of AI reasoning techniques — including
chain-of-thought prompting, retrieval-augmented generation,
confidence calibration, and evidence accumulation — to define how
the platform builds progressive confidence about incident severity
and failure origin rather than making binary selections from rigid
thresholds. You will shape how LLMs are applied to diagnostic
summarization, resolution suggestion, and automated stakeholder
communication. - Define the multi-agent architecture that
orchestrates detection, investigation, consolidation, diagnosis,
and remediation as a coordinated system rather than isolated
capabilities. You will work with engineering to define agent roles,
communication protocols, handoff conditions, and safety boundaries
ensuring that self-governing agents act with appropriate confidence
and escalate appropriately when uncertainty is high. - Translate
complex operational and technical requirements into a prioritized
backlog, making clear tradeoffs between feature depth, platform
scalability, and autonomous site readiness milestones. You will
serve as the voice of Incident Managers, domain engineers, and
Operations Control Center stakeholders, deeply understanding their
daily workflows and advocating for their needs during
executive-level planning and prioritization. - Define and track the
business case across all three programs — including mean time to
resolve improvements, lost labor hour reduction, and first page
resolution improvement — to secure continued investment. You will
establish mechanisms to measure platform performance against key
metrics including auto-detection rate, false positive rate,
consolidation accuracy, and remediation success rate, iterating
rapidly based on data. - Drive cross-functional alignment across
Fulfillment Technologies, Robotics, Network Engineering,
Application teams, and Operations to ensure the platform's
cross-domain orchestration model is well understood and adopted.
You will lead executive-level reviews of program progress, risks,
and investment cases, communicating clearly about the path from
near-term detection improvements to longer-term autonomous site
readiness. A day in the life You spend most of your time at the
intersection of product strategy and hands-on technical work. A
typical day might start by pulling incident data into a notebook to
test a new detection signal, then jumping into a whiteboard session
with engineers debating multi-agent handoff reasoning. You might
prototype a diagnostic flow in the afternoon just to prove a
concept is worth building. And occasionally you will find yourself
in the operations center watching real operators work through a
network failure — because staying grounded in how people actually
experience the platform is what separates good product selections
from great ones. Amazon offers a full range of benefits that
support you and eligible family members, including domestic
partners and their children. Benefits can vary by location, the
number of regularly scheduled hours you work, length of employment,
and job status such as seasonal or temporary employment. The
benefits that generally apply to regular, full-time employees
include: - Medical, Dental, and Vision Coverage - Maternity and
Parental Leave Options - Paid Time Off (PTO) - 401(k) Plan If you
are not sure that every qualification on the list above describes
you exactly, we'd still love to hear from you! At Amazon, we value
people with unique backgrounds, experiences, and skillsets. If
you’re passionate about this role and want to make an impact on a
global scale, please apply! About the team The Infrastructure
Reliability team sits within Amazon's Robotics organization,
operating as the cross-domain orchestration layer for a fulfillment
network that processes customer orders continuously across
thousands of sites. Our mission is simple and purposeful:
operations never stop, no matter what breaks. We do not own any
single domain — instead, we build the platform that sees across all
of them, identifying failures that cascade across team boundaries
and coordinating the capabilities that domain teams have built to
resolve those failures faster than any single team could alone. We
are now building the AI-powered platform that applies machine
learning, reasoning, and multi-agent orchestration to take our
results from promising to industry-defining. We value expert rigor,
customer obsession, and hands-on technical depth. The ideal
teammate is as comfortable writing a proof-of-concept as they are
writing a product strategy document. If you want to work on a
problem that is technically fascinating, operationally critical,
and commercially enormous, this is the team for you. - Bachelor's
degree - Experience owning/driving roadmap strategy and definition
- Experience with feature delivery and tradeoffs of a product -
Experience contributing to engineering discussions around
technology decisions and strategy related to a product - Experience
managing technical products or online services - Experience in
representing and advocating for a variety of critical customers and
stakeholders during executive-level prioritization and planning -
Experience in using analytical tools, such as Tableau, Qlikview,
QuickSight - Experience in building and driving adoption of new
tools Amazon is an equal opportunity employer and does not
discriminate on the basis of protected veteran status, disability,
or other legally protected status. Our inclusive culture empowers
Amazonians to deliver the best results for our customers. If you
have a disability and need a workplace accommodation or adjustment
during the application and hiring process, including support for
the interview or onboarding process, please visit
https://amazon.jobs/content/en/how-we-hire/accommodations for more
information. If the country/region you’re applying in isn’t listed,
please contact your Recruiting Partner. The base salary range for
this position is listed below. Your Amazon package will include
sign-on payments and restricted stock units (RSUs). Final
compensation will be determined based on factors including
experience, qualifications, and location. Amazon also offers
comprehensive benefits including health insurance (medical, dental,
vision, prescription, Basic Life & AD&D insurance and option
for Supplemental life plans, EAP, Mental Health Support, Medical
Advice Line, Flexible Spending Accounts, Adoption and Surrogacy
Reimbursement coverage), 401(k) matching, paid time off, and
parental leave. Learn more about our benefits at
https://amazon.jobs/en/benefits . USA, MA, North Reading -
151,200.00 - 204,600.00 USD annually USA, TN, Nashville -
143,700.00 - 194,300.00 USD annually USA, TX, Austin - 151,200.00 -
204,600.00 USD annually USA, VA, Arlington - 151,200.00 -
204,600.00 USD annually
Keywords: Amazon, San Antonio , Senior Product Manger - Tech, Infrastructure Reliability, Engineering , Austin, Texas