It is an unique opportunity to join the creators of a modern streaming platform for mission critical workloads.
Our customer is the Intelligent Data API company, founded in 2019 and headquartered in San Francisco. They have built an engine as a drop-in replacement for Kafka with higher throughput, that can scale more easily and deliver consistently low latencies with a much lower hardware footprint. They are is working on a family of products designed to reliably transform data streams into data products by unifying historical and real time data, enabling inline Lambda transformations, all exposed under a drop-in Kafka-API replacement.
The engine is written in C++17 w/ DMA (no kernel page cache, their own write-behind, read-ahead, caches, allocators, etc.) and is built for predictable tail latencies. No Zookeeper, no JVM, no code changes required, and 10x faster. It hosts some of the worlds largest streaming workloads – petabytes of data – targeting the Fortune 2000 companies – from autonomous aircraft and cars to real time ad bidding. No matter the scale, it will keep all data safe.
As a SRE you will be a part of a cloud team, working with all of engineering on building new services, automating infrastructure lifecycle on Kubernetes, and monitoring their services with the goal of offering a reliable, scalable and high-performance SaaS. One of company’s primary goals is to run a managed, cloud-based streaming-as-a-service with 99.5% uptime or better, and this role is critical for that goal. The team members are from both US (New York City, San Francisco, San Diego, Austin, Denver) and international locations, including Colombia, the United Kingdom, Russia, Poland, Israel, Czech Republic, and growing!
Site Reliability Engineer
Location: 100% remote
- Experience in an SRE-like role.
- Comfortable working with a 100% distributed engineering team, collaborating on GitHub, in the open.
- Strong experience with public cloud providers.
- Experience running highly-scalable production workloads reliably on Kubernetes.
- Experience with monitoring at scale.
- Experience managing infrastructure predictably through GitOps and IaC.
- Solid programming skills.
- Excellent written communication skills.
Nice to have
- Strong understanding of Go and Kubernetes.
- Experience operating a SaaS platform.
- Fluency in a couple of programming languages (for example, Go or Python).
- Operated and used streaming platforms either as a user or provider.
- Experience with the Prometheus monitoring stack.
- Build & design Vectorized’s cloud infrastructure with reliability and performance in mind.
- Build tools & services to allow automated infrastructure management and self-healing, including deployments and upgrades.
- Be in charge of end-to-end monitoring of our cloud. Layer observability into Kubernetes operators. Prioritize what metrics to collect, drive analysis of those metrics, and influence a roadmap based on that analysis.
- Participate in on-call rotations, working to keep customer workloads running and incident free.
- Opportunity to work very closely with a group of an exceptional engineers.
- Working with a company pushing the state of the art in streaming.
- Equity in an early stage startup.
- 100% remote role.