“One of the things that always strikes me as pretty unusual and pretty special about Facebook is that dozens of people are working on something that potentially billions of people use. People really get strong ownership, to really work, and develop these features and services, operate and troubleshoot them, fix them, evolve them, rearchitect them.”
Facebook’s core platform, the one that services some of the most popular apps and services on the planet, handles 15 billion queries every second. 15 billion. Every second. In this interview with Jay Parikh, longtime technical leader in the company, talks about building the infrastructure — from dirt to device — that makes it all possible and what it means to lead a global engineering community that’s responsible for developing everything from data centers to autonomous aircraft.
Below is an excerpt from Jay’s conversation with Steve Herrod and Quentin Clark from this episode.
Jay Parikh | Equivalent to Magic
Quentin Clark: Jay Parikh is Facebook’s VP of engineering. He’s the architect that’s behind Facebook’s data center infrastructure. He helped design and execute the physical layer that underpins the platform.
Steve Herrod: That’s right. Jay’s been with the company for over a decade, which means he has witnessed one of the most stunning growth stories in modern businesses. Today, Facebook has three billion users and at the turn of last decade, users were just in the hundreds of millions.
Jay Parikh: When I joined back in 2009, we were growing really, really rapidly, there was about 300 million or so people using Facebook at the time.
“… we were just starting to feel the pressure and the stress and the strain of the infrastructure being able to keep up with this acceleration of growth. We bought servers from the likes of HP and Dell, we put them in typical colocation facilities, but these things weren’t keeping up with our growth and we really had to rethink our approach.”
And that became this journey where we started to invest and really build out our infrastructure layer by layer, component by component. Fast forward back to today, we’re serving a community of three billion people using Facebook. And it is the platform you can think of it as an internal cloud infrastructure, hardware and software that is custom built and tuned to support this underlying technology base that powers our family of apps, the Facebook app, Messenger, WhatsApp, Instagram, Oculus, Workplace, et cetera.
One of Jay’s biggest tests came a few years into the job when Hurricane Sandy smashed into the eastern seaboard in 2012. Over eight million homes across 17 states in the US lost power, sometimes for weeks, buildings were swept into the ocean, the electrical grid crumbled. And Facebook’s networking system was also under acute threat.
People needed to be able to communicate with loved ones, they needed to be able to share, they needed to be able to coordinate some of the response and crisis management activities.
“We became this essential service that was so critical for millions of people to be able to navigate that storm and then the cleanup efforts and the recovery efforts after that… it really sunk in deep for many of us in the team to say we were such a critical service at this time.”
We probably got pretty lucky at this time and probably needed to rethink how we were approaching just the overall reliability and resiliency of our software, of our hardware systems… So we started with the data centers, we built one type of server. It was a very simple computing server. From there, we expanded to all of the other servers skews that we do storage and other types of workloads that we optimize for internally, then a year or two later, we said, “Okay, we’ve got to start peeling the onion on the networking stack.” So we built our own networking switch. At that point, we had a switch, but we needed to build a piece of software, we needed to run our own operating system on the switch to be able to give us the type of network management and the type of performance and the type of routing that we needed to have within the data center to handle this ever increasing amount of data that’s flowing between our servers.
So, we started with our first data center, custom build in Prineville, Oregon. We bought a big piece of land and had a really, really small team just started off with a few people thinking through, “Hey, how do we build a really big data center that can give us a lot of capacity, a lot of performance and a lot of flexibility? But we want to do it in a hyper efficient way.” And I think one of our neat approaches that we had here is we had a systems level thinking to this. And as I joke with you, Steve, we always think about our infrastructure going from the dirt to the device. And really thinking about the system, not just being, hey, there’s a big building and then here are some servers you put in it. And then there are some other networking stuff that happens. What we really tried to do is think about all of that being one unit that has to operate super efficiently for us, it has to operate together and it has to give us flexibility for what’s coming down the road from a product or a workload perspective.
“I think we just have been iterating for the last 10 years… I think that’s another critical part of building this infrastructure is that we don’t want the blueprint to ever get stuck. We don’t want to be status quo on this physical and software infrastructure. We want all of these pieces to be evolving over time.”
Steve Herrod: That one touches on a technical challenge. I think it also touches on cultural challenges and we’ve spent a lot of time talking about technical leadership in general, but as you’re reflecting on your own career, how do you think the leadership role itself has changed and the way that you spent your time as the group grew and as the responsibilities grew?
Jay Parikh: So, my role originally started off supporting our infrastructure team. And we’ve talked a lot about that. And that’s the hardware and the software and the data centers and the storage systems and the security systems and data networking. I’d say probably around seven years ago, I picked up another role, which was to head up all of engineering at Facebook. And really, that role was meant to support the evolution and scale of our engineering community worldwide. And I’d say that part of my job has been pretty consistent for the last 10 years, is how do you scale the team, the people, the leadership, continuing to make sure that our recruiting practices and our development practices internally are scaling and consistent and are really helping people grow, learn, but also making sure these teams are being as effective and as successful as possible.
One of the things that always strikes me is pretty unusual and pretty special about Facebook, one of the many things is that there’s a lot of teams when you look at them size wise, you would be surprised as to how small they are. Dozens of people are working on something that potentially billions of people use. And I think that type of responsibility, or that type of opportunity to learn and to have that type of impact and to not just be “Hey, here’s my little method that I’m responsible for.” And I’m responsible for this one method out of 10,000 methods in the team.
The other thing just as an engineer and as a geek is that the scale is exceptional. It is so fun to see the scale. I thought things were really, really big when I got to Facebook and boy, things look really small now compared to now. But we have this one system that is really at the core of the family of apps in terms of these social relationships that happen in our family of apps. That one system today handles somewhere around 15 billion queries every second. 15 billion queries every second. This is just one system and we have thousands of these services across Facebook. And that scale is just phenomenal. You think about how that translates across the stack in terms of the numbers of servers that we have to manage and deal with the number of data centers, the amount of fiber backbone, networking switches, disk drives, flash drives, it’s pretty profound. And one of the really cool things is that we’ve taken this entire stack that I mentioned to you, the hardware, the software, but we’ve really wrapped the element of how do we operate this really efficiently too.
Equivalent to Magic
Quentin Clark: We like to bookend every interview with this question. What in your career at Facebook, but even outside, do you look back at and feel that the technology delivered was the most equivalent to magic?
Jay Parikh: So, I will say that when I first got to Facebook in 2009, I never imagined that I would be part of a team that had to design, build and fly an airplane. Several years ago, as part of our connectivity efforts, one of the research projects that we had was to see if we could build this autonomous drone. And this isn’t like your little quad copter that you might fly around in your backyard, this drone that we designed, built and flew had the wingspan the size of a 737. And it was an autonomous flying plane and we had a couple of very successful test flights of that plane, but that is one of those moments where having been at the first flight, when that thing took off and hooked itself into the air and flew around for I think about an hour and a half, it was pretty magical to see such a hard working team, a small team take this sketch of an idea and put it all together. And we did this in the matter of a couple of years, whereas typically, this may take 50 to 100% longer.
Jay Parikh: But I will say that there are so many magical moments over the years. And even the stuff that I have talked about in this show is just a small sliver of what the team has actually accomplished. And it’s kind of funny because I always tell the team that our main job is to actually build and to provide the services in a way that it is magical to our product teams so that they can dream up all these amazing experiences to provide features and products to this community of billions of people. And I’m really proud of what the team has done and it has absolutely been this incredible ride of a lifetime for me.
Subscribe to Equivalent to Magic to hear more interviews from tech and product leads from companies including Slack, Twitter, Uber, Intuit and Instacart.