Email: tangchq AT gmail. LinkedIn profile. Chinese name: 唐春强
I am a Senior Director of Engineering at Meta/Facebook. I currently work on AI and Systems Co-design, encompassing AI, LLM, GPU/accelerator, hardware and systems software in general. Previously, I worked on Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and database in Meta's planetary-scale private cloud, comprising millions of machines and offering capabilities similar to those of public clouds. Prior to Facebook, I was a research scientist and manager at IBM T.J. Watson Research Center.
My research publications might help shed some light on my past work; see the full list at Google Scholar.
Selected publications from my work at Meta/Facebook:
[OSDI'24] MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale
[OSDI'24] ServiceLab: Preventing Tiny Performance Regressions at Hyperscale through Pre-Production Testing
[OSDI'24] Optimizing Resource Allocation in Hyperscale Datacenters: Scalability, Usability, and Experiences
[IEEE Micro Top Picks'24] Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters
[NSDI'24] MobileConfig: Remote Configuration Management for Mobile Apps at Hyperscale
[Meta Engineering Article] How Meta built the infrastructure for Threads
[SOSP'23] XFaaS: Hyperscale and Low Cost Serverless Functions at Meta
[OSDI'23] Conveyor: One-Tool-Fits-All Continuous Software Deployment at Meta
[OSDI'23] ServiceRouter: Hyperscale and Minimal Cost Service Mesh at Meta
[OSDI'23] Global Capacity Management With Flux
[ISCA'23 Best Paper] Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters
[IEEE Micro Top Picks'23] IOCost: Block IO Control for Containers in Datacenters
[ASPLOS'22 Best Paper] TMO: Transparent Memory Offloading in Datacenters
[ASPLOS'22] IOCost: Block IO Control for Containers in Datacenters
[SOSP'21] Shard Manager: A Generic Shard Management Framework for Geo-distributed Applications
[SOSP'21] RAS: Continuously Optimized Region-Wide Datacenter Resource Allocation
[OSDI'20] Twine: a Unified Cluster Management System for Shared Infrastructure
Selected publications from my work at IBM Research:
[USENIX ATC'11] FVD: a High-Performance Virtual Machine Image Format for Cloud. Download FVD code for QEMU
[USENIX ATC'09] vPath: Precise Discovery of Request Processing Paths from Black-Box Observations of Thread and Network Activities
[SIGIR'08] On Iterative Intelligent Medical Search
[SIGMOD'07] Resource-Adaptive Real-Time New Event Detection
[WWW'07] A Scalable Application Placement Controller for Enterprise Data Centers
[SCC'06 Best Paper] A Distributed Service Management Infrastructure for Enterprise Data Centers Based on Peer-to-Peer Technology
[SIGMETRICS'05] Low Traffic Overlay Networks with Large Routing Tables
Selected publications from my PhD work:
[SIGCOMM'03] Peer-to-Peer Information Retrieval Using Self-Organizing Semantic Overlay Networks
[NSDI'04] Hybrid Global-Local Indexing for Efficient Peer-to-Peer Information Retrieval
[SIGIR'04] On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems
[ICPP'02 Best Paper] Multi-level Shared State for Distributed Systems