Email: tangchq AT gmail. LinkedIn. Chinese name: 唐春强. Github. Google Scholar. We are hiring!
I am a Senior Director of Engineering at Meta/Facebook. I joined Facebook in 2013 and have worked on a wide range of production systems used by billions of users, encompassing AI, ASIC/GPU/Accelerator, LLM/Llama, hardware/software co-design, High Performance Computing (HPC), Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and databases. Prior to Facebook, I was a research scientist and manager at IBM T.J. Watson Research Center.
My publications below shed light on some aspects of my work. All these publications reflect the hyperscale production systems we have built at Meta, rather than merely research prototypes.
Recent Best Paper Awards:
[SOSP'24 Best Paper] FBDetect: Catching Tiny Performance Regressions at Hyperscale through In-Production Monitoring
[OSDI'24 Best Paper] ServiceLab: Preventing Tiny Performance Regressions at Hyperscale through Pre-Production Testing
[ISCA'23 Best Paper] Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters
[ASPLOS'22 Best Paper] TMO: Transparent Memory Offloading in Datacenters
[IEEE Micro Top Picks'24] Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters
[IEEE Micro Top Picks'23] IOCost: Block IO Control for Containers in Datacenters
[SOSP'23, Best Serverless Paper of 2023] XFaaS: Hyperscale and Low Cost Serverless Functions at Meta. This paper was selected by the 9th Workshop on Serverless Computing as the Best Serverless Paper of 2023 out of all serverless papers published that year.
More papers:
[ISCA'25] Scaling Llama 3 Training with Efficient Parallelism Strategies
[ISCA'25] Meta's Second Generation AI Chip: Model-Chip Co-Design and Productionization Experiences
[ISCA'25] DCPerf: An Open-Source, Battle-Tested Performance Benchmark Suite for Datacenter Workloads
[Communications of the ACM] Meta's Hyperscale Infrastructure: Overview and Insights
[OSDI'24] MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale
[OSDI'24] Optimizing Resource Allocation in Hyperscale Datacenters: Scalability, Usability, and Experiences
[NSDI'24] MobileConfig: Remote Configuration Management for Mobile Apps at Hyperscale
[OSDI'23] Conveyor: One-Tool-Fits-All Continuous Software Deployment at Meta
[OSDI'23] ServiceRouter: Hyperscale and Minimal Cost Service Mesh at Meta
[OSDI'23] Global Capacity Management With Flux
[ASPLOS'22] IOCost: Block IO Control for Containers in Datacenters
[SOSP'21] Shard Manager: A Generic Shard Management Framework for Geo-distributed Applications
[SOSP'21] RAS: Continuously Optimized Region-Wide Datacenter Resource Allocation
[OSDI'20] Twine: a Unified Cluster Management System for Shared Infrastructure