Monday, April 29, 2024

An Engineer Who Keeps Meta’s AI infrastructure Humming




Making breakthroughs in artificial intelligence these days requires huge amounts of computing power. In January, Meta CEO Mark Zuckerberg announced that by the end of this year, the company will have installed 350,000 Nvidia GPUs—the specialized computer chips used to train AI models—to power its AI research.

As a data-center network engineer with Meta’s network infrastructure team, Susana Contrera is playing a leading role in this unprecedented technology rollout. Her job is about “bringing designs to life,” she says. Contrera and her colleagues take high-level plans for the company’s AI infrastructure and turn those blueprints into reality by working out how to wire, power, cool, and house the GPUs in the company’s data centers.

Susana Contrera


Employer:

Meta

Occupation:

Data-center network engineer

Education:

Bachelor’s degree in telecommunications engineering, Andrés Bello Catholic University in Caracas, Venezuela

Contrera, who now works remotely from Florida, has been at Meta since 2013, spending most of that time helping to build the computer systems that support its social media networks, including Facebook and Instagram. But she says that AI infrastructure has become a growing priority, particularly in the past two years, and represents an entirely new challenge. Not only is Meta building some of the world’s first AI supercomputers, it is racing against other companies like Google and OpenAI to be the first to make breakthroughs.

“We are sitting right at the forefront of the technology,” Contrera says. “It’s super challenging, but it’s also super interesting, because you see all these people pushing the boundaries of what we thought we could do.”

Cisco Certification Opened Doors

Growing up in Caracas, Venezuela, Contrera says her first introduction to technology came from playing video games with her older brother. But she decided to pursue a career in engineering because of her parents, who were small-business owners.

“They were always telling me how technology was going to be a game changer in the future, and how a career in engineering could open many doors,” she says.

She enrolled at Andrés Bello Catholic University in Caracas in 2001 to study telecommunications engineering. In her final year, she signed up for the training and certification program to become a Cisco Certified Network Associate. The program covered topics such as the fundamentals of networking and security, IP services, and automation and programmability.

The certificate opened the door to her first job in 2006—managing the computer network of a business-process outsourcing company, Atento, in Caracas.

“Getting your hands dirty can give you a lot of perspective.”

“It was a very large enterprise network that had just the right amount of complexity for a very small team,” she says. “That gave me a lot of freedom to put my knowledge into practice.”

At the time, Venezuela was going through a period of political unrest. Contrera says she didn’t see a future for herself in the country, so she decided to leave for Europe.

She enrolled in a master’s degree program in project management in 2009 at Spain’s Pontifical University of Salamanca, continuing to collect additional certifications through Cisco in her free time. In 2010, partway through the program, she left for a job as a support engineer at the Madrid-based law firm Ecija, which provides legal advice to technology, media, and telecommunications companies. Following that with a stint as a network engineer at Amazon’s facility in Dublin from 2011 to 2013, she then joined Meta and “the rest is history,” she says.

Starting From the Edge Network

Contrera first joined Meta as a network deployment engineer, helping build the company’s “edge” network. In this type of network design, user requests go out to small edge servers dotted around the world instead of to Meta’s main data centers. Edge systems can deal with requests faster and reduce the load on the company’s main computers.

After several years traveling around Europe setting up this infrastructure, she took a managerial position in 2016. But after a couple of years she decided to return to a hands-on role at the company.

“I missed the satisfaction that you get when you’re part of a project, and you can clearly see the impact of solving a complex technical problem,” she says.

Because of the rapid growth of Meta’s services, her work primarily involved scaling up the capacity of its data centers as quickly as possible and boosting the efficiency with which data flowed through the network. But the work she is doing today to build out Meta’s AI infrastructure presents very different challenges, she says.

Designing Data Centers for AI

Training Meta’s largest AI models involves coordinating computation over large numbers of GPUs split into clusters. These clusters are often housed in different facilities, often in distant cities. It’s crucial that messages passing back and forth have very low latency and are lossless—in other words, they move fast and don’t drop any information.

Building data centers that can meet these requirements first involves Meta’s network engineering team deciding what kind of hardware should be used and how it needs to be connected.

“They have to think about how those clusters look from a logical perspective,” Contrera says.

Then Contrera and other members of the network infrastructure team take this plan and figure out how to fit it into Meta’s existing data centers. They consider how much space the hardware needs, how much power and cooling it will require, and how to adapt the communications systems to support the additional data traffic it will generate. Crucially, this AI hardware sits in the same facilities as the rest of Meta’s computing hardware, so the engineers have to make sure it doesn’t take resources away from other important services.

“We help translate these ideas into the real world,” Contrera says. “And we have to make sure they fit not only today, but they also make sense for the long-term plans of how we are scaling our infrastructure.”

Working on a Transformative Technology

Planning for the future is particularly challenging when it comes to AI, Contrera says, because the field is moving so quickly.

“It’s not like there is a road map of how AI is going to look in the next five years,” she says. “So we sometimes have to adapt quickly to changes.”

With today’s heated competition among companies to be the first to make AI advances, there is a lot of pressure to get the AI computing infrastructure up and running. This makes the work much more demanding, she says, but it’s also energizing to see the entire company rallying around this goal.

While she sometimes gets lost in the day-to-day of the job, she loves working on a potentially transformative technology. “It’s pretty exciting to see the possibilities and to know that we are a tiny piece of that big puzzle,” she says.

Hands-on Data Center Experience

For those interested in becoming a network engineer, Contrera says the certification programs run by companies like Cisco are useful. But she says it’s also important not to focus just on simply ticking boxes or rushing through courses just to earn credentials. “Take your time to understand the topics because that’s where the value is,” she says.

It’s good to get some experience working in data centers on infrastructure deployment, she says, because “getting your hands dirty can give you a lot of perspective.” And increasingly, coding can be another useful skill to develop to complement more traditional network engineering capabilities.

Mainly, she says, just “enjoy the ride” because networking can be a truly fascinating topic once you delve in. “There’s this orchestra of protocols and different technologies playing together and interacting,” she says. “I think that’s beautiful.”

Reference: https://ift.tt/Fxk5hLy

No comments:

Post a Comment

NATO’s Emergency Plan for an Orbital Backup Internet

On 18 February 2024, a missile attack from the Houthi militants in Yemen hit the cargo ship Rubymar in the Red Sea. With the crew evacu...