About the Team
The Delivery Engineering team is an essential part of The New York Times’ engineering organization. Its responsibilities are profoundly technical and include system cloud architecture, developer tooling, observability, and development process, to name a few.
You will be a part of the Observability Platform Team within Delivery Engineering responsible for standardizing monitoring and observability practices across engineering teams at the New York Times to provide valuable insights to drive decisions for a resilient and positive customer experience.
As a senior engineer of the Observability Platform team, you will be responsible for building capabilities like secure provisioning of tools, reliable data collection, correlation and visualization to help derive valuable insights to improve the visibility and reliability of The New York Times products.
These observability systems include logs, metrics, tracing and profiling data management, open source standards (OpenTelemtry, OpenMetrics, etc), and data correlation and visualization related techniques.You will report to the Senior Manager of Observability Platforms and contribute to these capabilities, as well as evaluate the current observability practices and tooling, and evolve them to be more efficient.
Improve NYT’s observability landscape, by allowing easy access to metrics, logs, tracing and profiling with a reliable set of tools.
Determine and promote the right observability patterns and instrumentation/correlation standards that suits the needs of NYT applications.
Build high quality reliable insights that provide visibility and standards for key indicators to understand the health of most critical systems.
Work within multiple areas of focus (e.g. reliability, edge platform, cloud runtime, secrets management, deployment pipeline, containerization) and research, strategize, and propose solutions that meet requirements, reduces friction for product engineers, and consolidates existing solutions.
Promote Observability and SRE best practices through Architecture Reviews, blameless postmortems, technical talks, and tooling.
Document best practices, prescribed solutions, and production support playbooks.
Production support by participating in on-call rotations for the systems we build, and providing expertise to users of our solutions. Contribute to our mission of reaching 10+ million paid subscribers by 2025
A minimum of 8+ years of backend software development experience with a minimum of 5+ years experience building high quality secure provisioning automations, seamless self service onboarding practices and building observability shared platforms for an entire organization with a focus on reliability and operation visibility.
A high degree of passion and interest in Observability and Reliability practices and experience working with application teams to deliver their apps and services with high levels of observability standards.
A good grasp of multi-tier application architecture and concepts of reliable system engineering, open source observability standards and practices(Ex. OpenTelemtry, OpenMetrics).
Solid programming and troubleshooting skills. You may be called upon to help with systems written in Go, Python, Java, Scala, PHP, and Ruby amongst many other programming languages. We don’t expect you to know everything but be able to learn and adapt quickly to these needs.
An understanding of cloud-based design and deployments on Amazon Web Services and/or Google Cloud Platform.
A passion towards automation and proficiency with Cloud-native App Development, Ex. 12-factor apps, container orchestration technologies, and immutable cloud provisioning.
A bias towards helping people. Many teams will rely upon you for help to build their systems.
A high degree of empathy for existing solutions and issues. The New York Times is modern in many ways but is also prone to having issues that a 165 year old organization may have – including legacy systems. There are many things to fix.
Nice To Have Experience
Site Reliability Engineering(SRE) practices and blameless incident management for large, system-wide issues
Configure and deploy systems and software in production
Infrastructure as Code (IaC) practices specifically using open source projects like Terraform, Vault, Consul, OPA
Some of the tech we use:
Sumologic, Datadog, OpenTelemtry, Go, GCP, AWS, Docker, Kubernetes, Drone, Terraform, Vault, Consul, Fastly