Technical Areas
I am a Senior Director of Software Engineering at Meta/Facebook. I work on IaaS and PaaS in Meta's planetary-scale private cloud, comprising millions of machines and offering capabilities similar to those of public clouds. Currently, I lead Meta's Compute Platform (PaaS) organization, which works on the areas listed below. Much of our work is publicly visible and more information can be found through the hyperlinks highlighted below.
Platform as a Service (PaaS) & enablement for ML Inference (akin to Azure Resource Manager, Google App Engine, Amazon Elastic Inference)
Serverless (akin to AWS Lambda, Google Cloud Functions, Azure Functions)
Queuing service (akin to AWS SQS)
Workflow engine (akin to AWS Step Functions)
Continuous deployment tool (akin to AWS CodeDeploy, Spinnaker)
Application configuration management (akin to Azure App Configuration)
Distributed coordination and consensus (akin to ZooKeeper)
Data distribution for ML model, container images, etc. (akin BitTorrent)
RPC (akin to gRPC)
Service mesh and load balancing (akin to Istio, AWS ELB/ALB)
Service Discovery (akin to Consul, Eureka)
Key-value store and table storage for control plane (not for user data)
BLOB storage for control plane (not for user data)
Polyglot language support tools
Cloud console (akin to AWS Management Console)
Prior to 2021, I led Meta’s Compute Infra (IaaS) organization and also database teams at different times, working on the following foundational technologies:
Cluster management, containers, IaaS & enablement for ML training (akin to Kubernetes, AWS EC2/ECS/EKS)
Programming framework for sharding and stateful services (akin to Azure Service Fabric)
Key-value database for user data (akin to AWS DynamoDB)
Capacity and quota management (akin to AWS Service Quotas)
Caching system (akin to Amazon ElastiCache, Redis)
=================
My publications might help shed some light on my past work. See the full list of my 70+ papers and 97 patents. I run Meta’s systems research program and have helped elevate it to world-class status. In addition to my overall leadership, I have made significant hands-on contributions to the following Meta research papers as I was closely involved in the technical aspects of the work. These publications reflect the production systems at Meta.
[Engineering at Meta] How Meta built the infrastructure for Threads
[SOSP'23] XFaaS: Hyperscale and Low Cost Serverless Functions at Meta
[OSDI'23] Conveyor: One-Tool-Fits-All Continuous Software Deployment at Meta
[OSDI'23] ServiceRouter: Hyperscale and Minimal Cost Service Mesh at Meta
[OSDI'23] Global Capacity Management With Flux
[ISCA'23 Best Paper] Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters
[IEEE Micro Top Picks'23] IOCost: Block IO Control for Containers in Datacenters
[ASPLOS'22 Best Paper] TMO: Transparent Memory Offloading in Datacenters
[ASPLOS'22] IOCost: Block IO Control for Containers in Datacenters
[SOSP'21] Shard Manager: A Generic Shard Management Framework for Geo-distributed Applications
[SOSP'21] RAS: Continuously Optimized Region-Wide Datacenter Resource Allocation
[OSDI'20] Twine: a Unified Cluster Management System for Shared Infrastructure