Google (2018-)
Team: Personalized Service Health
A GCP tool that provides fast, transparent, relevant and actionable communication
about Google Cloud service disruptions
[official page]
Role: Software Engineer III (2023-03 to current)
Notification Workflow for Scheduled Maintenance (Ongoing Project)
- Design and implement the processing system in a parallel and distributed model
- Design and implement the consumer APIs for viewing all maintenances
- Co-lead the private preview launch
- Related skills: GCP, Distributed System, Parallel Computing, Cloud API, Golang
Team: Virtual Testing
A high-fidelity simulated production environment for development and release
integration testing built on top of Google Compute Engine (Virtual Machine).
This is an internal testbed.
Role: Site Reliability Engineer III (2019-11 to 2023-03)
Role: Software Engineer II (2018-04 to 2019-11)
Simulating Production Environment in VM-based Cluster
- Onboarded multiple Google Production services to VM-based test environments.
This unblocked the critical path for testing turnup for cloud region.
The services onboarded including:
- Spanner: A distributed SQL database management and storage service
- Name redacted: A remote debug logging system
- Name redacted: A dynamic data push system
- Name redacted: A package management system
- Related skills: Virtualization, Production, Database, Distributed System,
Containerized Development, C++, Python, Shell, Golang, Logging
Coverage Analysis for Turnup Tests
- Defined RPC-based test coverage metric in cloud turnup tests for test quality
evaluation
- Defined workflow-based test coverage metric in cloud turnup tests for test
fidelity evaluation
- Designed and implemented instrumentation method for collecting coverage in
the test
- Created dashboards to visualize the coverage data
- Related skills: Software Testing, Distributed Systems, Monitoring,
Visualization, Bazel, Python, Dynamic Analysis
CPU Overcommit for Turnup Tests
- Enabled CPU overcommit in VM-based turnup tests and saved CPU cost for ~50%
- Created dashboards and SLOs to monitor the performance and reliability
- Designed automatic fallback mechanisms during resource shortage to ensure
the performance and reliability of the test tool
- Optimized the resource usage efficiency by avoiding fragmentations using
smarter bin-packing strategy
- Prevented resource contention by applying pessimistic concurrency
- Related skills: Cloud API, Advanced Algorithms, Monitoring, Parallel Computing,
Distributed Systems, Python
Cluster Lifecycle Automation in Virtualized Environments
- Designed and implemented an automated workflow to bootstrap all the
dependencies of a test environment
- Designed and implemented a tool to update multiple package versions acrosss
all VMs in a cluster statelessly
- Implemented tools to create, snapshot and delete a test cluster
- Related skills: Distributed System, Automation, Version Control, Linux,
Python, Golang, Shell
Other work (operational, community)
- Participate in oncall rotation, following Google’s SRE best practice
- Hosted intern at summer 2022
- Conduct interviews frequently (10+ interviews per year)
- Led several rounds of bug fix weeks and product excellence reviews
Amazon (2017)
Team: Customer Service Technology
Intern Project: Guided Workflow Card - New Model and Storage
- Initiated a new model of guided workflow card (a tool used by Amazon customer
service agent during phone calls) to support contextual representation
- Simplified the configuration format. The new model is ~30% length of the old
model
- Migrate the backend to a different storage service (Name redacted)
- Implemented the end-to-end authorization workflow for creating a new card
configuration, from configuring the model to storage
- Related Skills: AWS, UI design, Database, Java
Skills
Domain Expertise
- Distributed Systems
- Site Reliability Engineering
- Cloud Platforms
- Tech Infrastructures
- Algorithms
- Software Testing
- Databases
- Parallel Programming
- Programming Languages (Type System etc.)
- Google Cloud Platform
- Linux
- Mercurial
- Bazel
- Git
- AWS
- LLVM
- Django
- Tensorflow
- Kubernetes
- Docker
Programming Languages
- Python
- Golang
- C++
- Java
- Javascript
- Shell
- SQL
- HTML
- Matlab
- OCaml