Principal Site Reliability Engineer
Company: Arm Treasure Data Ltd
Posted on: February 22, 2021
Treasure Data began by offering data warehousing and processing
services; since then we've moved further up the value chain with
our Customer Data Platform application (CDP), which is seeing a lot
of traction with customers new and old. This growth has prompted a
greater focus on Site Reliability Engineering as we've growing past
our current practices and we're looking to add a 9th member to our
team, as such you'll playing an essential role in maturing the
company's approach to service reliability and continuity. The team
and you will be directly responsible for solutions for the platform
in these-- key areas:availability, latency, performance,
efficiency, change management, monitoring, emergency response, and
capacity planning . This will require working with engineering
teams on complex problems/projects where analysis of situations or
data requires an in-depth evaluation of multiple factors and wise
trade-offs between competing factors when arriving at a solution.
Success in this role requires a passion for helping others and
making their lives better, you do this by simplifying complex
systems to make them understandable and operable. You are able to
effectively communicate decisions, ideas, designs, and operation of
systems and services in a clear and concise manner. You are both a
generalist, capable of picking up and working with multiple,
disparate systems, and an expert, having an ability to dive deep
into specific topics and quickly master them. You comfortably move
between system, service, and instance level views. You have a love
of stateful systems containing Treasured data, ensuring we continue
to protect customer data from loss occurring from outages. Things
You Will Do
- Build and maintain services, automation, and tooling that will
positively impact key areas (see above)with our team, be
responsible for the systems you build.
- Drive continuous improvement by measuring and reducing the
amount of manual operational work.
- Help us measure and improve reliability and performance across
the product line by working with product owners and engineering
- Make wise decisions balancing availability and delivery, and
communicating those decisions clearly.
- Be an active participant and internal evangelist for our shared
processes, such as blameless post-mortems
- Work with engineering teams as a subject matter expert on
operating software and systems at scale, teaching them from your
experience or know-how, and helping them reach their goals.
- Investigate system performance, errors, and problems. Your
Background and Skills Will Include
- A minimum of 5+ years relevant working experience.
- Experience building and maintaining software addressing key SRE
areas of responsibility (see above).
- Strong Software Engineering experience, with an ability to work
in multiple programming languages.
- Experience with Distributed Systems and operating them as they
- Experience operating services running in the cloud (AWS
primarily) or virtualized API-driven platforms.
- Articulate and personable with strong spoken and written
English language abilities.
- Knowledge and experience in Systems Engineering,
Administration, and Operations.
- Demonstrate the ability to work independently and
collaboratively as part of a specialized team.
- Ability to slow down and communicate clearly and effectively
across language barriers. We Would Be Thrilled If You
- Have experience automating datastore operations or datastores
as a service.
- Crafted APIs and specifications that allow for future
non-breaking changes while remaining backwards compatible for as
long as possible.
- Had experience analyzing system-wide performance: latency,
throughput, and efficiency.
- A student of complex systems theory and how to build resilient
and adaptive systems.
- Able to build services backed by BLOB, relational, and/or
document data stores, currently: S3, PostgreSQL, and DynamoDB.
- Have experience working as part of a distributed or partially
distributed team and thrive in an a highly collaborative and
communicative work environment.
- Pride yourself on giving back to your community: open source
contributions, speaking, teaching, mentoring, or helping
- Experience speaking and/or writing Japanese. Working at
Treasure Data You can expect a work environment where the team is
collaborative and open to your ideas, while we keep our collective
eye on supporting our customers' needs.-- Our team is committed to
technical innovation in our product and in the world through
customer collaboration, open-source projects, and by continuing to
make our product an integral part of our customers' growth and
success. We are an equal opportunity employer dedicated to building
an inclusive and diverse workforce.-- We do not discriminate on the
basis of race, religion, color, national origin, gender, sexual
orientation, age, marital status, veteran status, or disability
status. About Us Treasure Data provides an end-to-end, fully
managed cloud service (data acquisition, storage and analysis
capability) for Big Data that is trusted and simple. As the
original developers of Fluentd, an advanced open-source log
collector specifically designed to solve the big data log
collection problem, Treasure Data solves the problems for companies
wanting the ability to manage their big data needs.-- Agencies and
recruiters, we cannot consider your candidate(s) without a contract
in place. Any resumes received without having an active agreement
will be considered gratis referrals to us. Thank you for your
understanding and cooperation!
Keywords: Arm Treasure Data Ltd, Vancouver , Principal Site Reliability Engineer, Engineering , Vancouver, Washington
Didn't find what you're looking for? Search again!