The lessons learned at Facebook | Cloud Networking Meetup 2018


WHILE THE SLIDES COME UP, I’M NAJAM AHMAD. HOW MANY PEOPLE ARE
ACTUALLY PART OF BUILDING AND RUNNING LARGE SCALE SERVICES IT’S ABOUT IT’S ABOUT 50%. THAT’S COOL. ONE OF THE THINGS
I OFTEN SAY IS YOU CAN’T HAVE CLOUD WITHOUT NETWORK TRY BUILDING CLOUDS
IF YOU DON’T HAVE ANY PIPES. SO NETWORK
IS WHAT I DO. I’VE BEEN REALLY
PASSIONATE ABOUT NETWORKS FOR PRETTY
MUCH MY ENTIRE CAREER. BUT FUNDAMENTALLY,
NETWORKS HAVE BEEN WHAT’S REALLY MADE CLOUD POSSIBLE, WHERE WE
CAN ACTUALLY DO THE SCALE AND ALLOWS
FOR A CLOUD TO ACTUALLY HAPPEN. A FEW WEEKS AGO I WAS SKIMMING LINKEDIN, JUST LOOKING FOR SOMEBODY, AND FOUND
THIS BLOG FROM AN ENGINEER ASKING
THIS QUESTION. SKILLS NETWORK
ENGINEERS NEED IN THE FUTURE OR WHAT
THE IMPLICATION WAS IN THE CLOUD
SORT OF ENVIRONMENT, WHICH GOT ME THINKING, SAYING, HMM, WHAT DO
WE DO AND WHAT DO ENGINEERS REALLY
ACTUALLY NEED GOING SO I STARTED
DOING SOME RESEARCH AND SEARCHES ON
WHAT PEOPLE THINK AND LOOKING AT JOB
DESCRIPTIONS EVEN ON WHAT PEOPLE THOUGHT
WERE THE SKILL SETS THERE WERE JUST A TON OF THINGS. ALL SORTS OF DIFFERENT
THINGS ABOUT TECHNOLOGY. A LOT OF PEOPLE
THINGS AS WELL. NONE OF THAT ACTUALLY REALLY
CONNECTED WITH ME. YES, WE ALL NEED
TECHNOLOGY, WE ALL NEED TO KNOW ABOUT CONTAINERS AND THINGS LIKE THAT, BUT THERE’S FUNDAMENTAL PHILOSOPHY THAT
I FELT LIKE WAS MISSING, WHICH
IS SORT OF WHAT INSPIRED THIS TALK. SO ONE OF THE
ARTICLES I ACTUALLY HAVE THIS PICTURE ON. AT FIRST BLUSH I’M LIKE, WAIT, WHAT ARE THEY
TRYING TO TELL ME? WHAT IS THIS THING? IT’S REALLY — I
GOT IT AFTER LIKE STARING AT IT FOR 30 SECONDS SAYING, OH, YOU WANT TO
BE THE — DON’T WANT TO BE THE PUPPET,
YOU WANT TO BE THE OKAY, WHAT DOES THAT MEAN? HOW DO YOU
TRANSLATE THAT INTO ACTUAL THINGS THAT
PEOPLE WANT TO DO? SO REALLY THINKING
ABOUT HOW WE AT FACEBOOK APPROACH BUILDING LARGE-SCALE
INFRASTRUCTURE AND THAT’S WHERE SORT
OF THE OPERATIONS FIRST CONVERSATION
STARTED AND INSPIRED THIS TALK. TO GIVE YOU
FIRST A VERY QUICK SCALE CONVERSATION
AROUND WHAT WE DEAL WITH FROM AN
INFRASTRUCTURE PERSPECTIVE, EVEN
THOUGH FACEBOOK IS NOT REALLY A CLOUD PROVIDER,
LIKE WE’RE NOT SELLING CLOUD
SERVICES PER SE, BUT AT THE SAME TIME WE’RE
SOLVING VERY SIMILAR PROBLEMS
AND AT PRETTY MUCH SIMILAR SCALES AS WELL. WE DO HAVE AN
ADVANTAGE THAT WE DON’T HAVE
THIRD-PARTY SERVICES RUNNING WITHIN OUR INFRASTRUCTURE,
WHICH MAKES THINGS A LITTLE BIT SIMPLER
BECAUSE WE DEAL WITH INTERNAL
PRODUCT GROUPS RATHER THAN EXTERNAL
CUSTOMERS RUNNING CODE IN OUR ENVIRONMENT. SO YOU GUYS HAVE
SEEN THIS NUMBER ALL THE TIME. INSTAGRAM, WHATSAPP, MESSENGER, THERE ARE SEVERAL SERVICES THAT
ARE BILLION PLUS USERS ALL OVER THE PLACE. MORE INTERESTING THINGS WITH THIS IS WE’RIN!$!E INGESTING 300 MILLION PHOTOS A DAY. TRY TO REPLICATE
THAT ACROSS THE NETWORK TO MAKE
SURE YOU DON’T LOSE ANY. THE NUMBER OF POSTS, THE NUMBER
OF EVENTS THAT ARE HAPPENING, IT’S MILLIONS PER MINUTE. YOU MULTIPLY
THAT BY A BUNCH OF USERS AND GET
VERY, VERY LARGE ON TOP OF THAT,
THE OTHER THING THAT’S INTERESTING
ABOUT FACEBOOK IS THAT IT’S IN MICROSERVICES
ARCHITECTURE, SO WE DO HAVE A BUNCH OF SERVICES THAT ARE SPLIT ACROSS
DATACENTERS OR SPLIT ACROSS THE
SAME DATACENTER AS WELL. SO WHEN A REQUEST COMES IN FROM YOUR MOBILE
PHONE AND GOES TO ONE OF OUR
FACILITIES AROUND THE WORLD, INTERNALLY
THIS IS WHAT’S YOU HAVE A LOT OF MICROSERVICES
THAT GET TOGETHER TO ACTUALLY GENERATE THAT
HOME PAGE THAT YOU SEE ON YOUR PHONE. SO A LOT OF THIS
STUFF IS ACTUALLY COMPUTED ON THE FLY
WHEN YOU LOG IN OR WHEN YOU CLICK ON
SOMETHING, A LOT OF THAT IS GENERATED
AT THAT TIME AND ACROSS A LOT OF MICROSERVICES THAT RUN IN OUR SO WHAT IS OPERATIONS FIRST? THAT’S THE MINDSET WE USE. WE USE OUR DISASTER RECOVERY STRATEGY AS AN
EXAMPLE OF HOW WE THINK ABOUT OPERATIONS FIRST. WE STARTED THIS
JOURNEY A WHILE AGO, MAYBE SIX YEARS AGO, WHEN THE BIG
THING WAS AUTOMATION. AUTOMATE, AUTOMATE, AUTOMATE. EVERYBODY WANTED
TO THINK ABOUT AUTOMATION OF THE NETWORK INFRASTRUCTURE FROM
DEPLOYMENTS TO ORCHESTRATION TO TRAFFIC
MANAGEMENT TO ALL SORTS OF MITIGATION CAPABILITIES. AND WE USED TO
HAVE THIS SLOGAN CALLED ENGINEER SPELLED ROBOTS. ROBOTS RUN THE NETWORK. THAT’S STILL VERY MUCH TRUE, VERY MUCH
WHAT WE DO AND FUNDAMENTALLY
HOW WE THINK ABOUT WE STARTED HERE AUTO MATING CLI TYPE STUFF. IT WAS PRETTY ARCANE, PRETTY PAINFUL. IF YOU LOOK AT SOME OF THE WORLD, LIKE THE
OPTICAL SPACE, YOU STILL HAVE THAT PROBLEM. THE OPTICAL INDUSTRY
HAS NOT GROWN UP YET. ONE OF THE THINGS
WE’RE WORKING ON TRYING TO CHANGE. WE DID GET BETTER. WE COULD PROGRAM A LOT. A LOT OF BOXES HAVE NOW API s THAT ARE
EXPOSED TO ACTUALLY BE ABLE TO DO STUFF. WE WANT TO GET HERE, SOON. BUT AUTOMATION ALONE
DOESN’T CUT IT ANYMORE AT THIS SCALE. PRIMARILY BECAUSE
OF THE COMPLEXITY OF THE ENVIRONMENT
AND THE DEPENDENCIES THAT YOU HAVE IN RUNNING A LARGE-SCALE SERVICE
WHICH IS ALSO DISTRIBUTED ACROSS SEVERAL FACILITIES AND
UNDERSTANDING THAT LEVEL OF COMPLEXITY
AND THAT LEVEL OF DEPENDENCY IS REALLY, REALLY TOUGH. SO ONE OF THE THINGS
THAT WE USE AT PRINCIPLES IS THINK OPERATIONS FIRST AND FEATURES WHAT THAT REALLY
MEANS IS IF YOU REALLY CAN’T OPERATE
THAT PARTICULAR FEATURE AT SCALE, MEANING YOU
DON’T HAVE ENOUGH INSTRUMENTATION,
YOU DON’T HAVE ABILITY TO DETECT FAILURES, YOU DON’T
HAVE ABILITY TO MITIGATE ALL THROUGH SOFTWARE, DON’T BUILD IT. OR DON’T DEPLOY IT, AT LEAST. SO THAT’S ONE OF
THE THINGS THAT WE KIND OF ARE STEADFAST ON. THAT IF YOU CAN’T
ACTUALLY DETECT AND MITIGATE THROUGH
SOFTWARE, WE’RE JUST NOT GOING TO DO IT. FUNDAMENTALLY, WE
GET TO THIS PART, WHICH IS DETECT FAILURES. ONE OF THE GOALS,
WHICH IS WAY OUT THERE STILL FOR US AS WELL, DETECT AND
MITIGATE FAILURES WITHIN ONE SECOND. THAT’S REALLY,
REALLY HARD TO DO IF I START TRYING TO DO IT. EVEN DETECTION
WITHIN ONE SECOND IS REALLY HARD. BUT FUNDAMENTALLY,
IF YOU WANT TO DETECT FAILURES
WITHIN SECONDS AND MITIGATE THEM, YOU
CAN’T DO IT WITH PEOPLE, PERIOD. ANY TIME YOU’VE
GOT A HUMAN BEING INVOLVED, IT’S A
MINIMUM OF 20 MINUTES. THAT’S YOUR CEILING, THAT’S
YOUR FLOOR, AND YOU HAVE TO GO UP FROM THERE. SO PHILOSOPHICALLY WHAT WE DO DIFFERENT IS WE
DON’T TRY TO BUILD STUFF THAT IS FAILPROOF. SO I GREW UP IN THE ISP WORLD OR
TRADITIONAL NETWORK ENGINEERING WORLD, WHICH IS
FUNDAMENTALLY SYSTEM INTEGRATION. WE USED TO TAKE
TECHNOLOGY FROM A BUNCH OF DIFFERENT
VENDORS, BRING THEM IN THE LABS, INTEGRATE
THEM, TEST THEM AND THEN DEPLOY THEM. AND THAT’S AS
DECENT AS MY LAST TIME AT MICROSOFT,
WHICH WAS EIGHT, NINE YEARS AGO, THE TECHNOLOGY
WAS STILL IN THAT SORT OF SPACE. IN THAT ENVIRONMENT,
WHAT YOU’RE REALLY TRYING TO
DO IS FIGURE OUT ALL THE INTEGRATION BOX, FIGURE OUT ALL
THE BUGS AND MAKE SURE THAT IT’S FAILPROOF AND DOES NOT FAIL. AT THIS SCALE IT DOESN’T WORK WE’RE MORE ABOUT FAILING FAST. WE’RE ACTUALLY
HAPPIER DEPLOYING STUFF THAT IS SORT OF 80% WE DEPLOY IT. BUT IF YOU HAVE
ENOUGH DETECTION AND MITIGATION
CAPABILITIES BUILT IN, YOU CAN DETECT
FAILURES AND MITIGATE IT OR TAKE IT OFFLINE, YOU’RE
MUCH BETTER OFF. SO PHILOSOPHICALLY
THAT’S WHAT WE TRY TO DO IS
WE DON’T TRY TO FAILPROOF STUFF, WE’RE THE OTHER THING THAT THIS ACTUALLY GOES
INTO IS THAT YOU CAN’T REALLY SIT DOWN AND ARCHITECT STUFF
FOR VERY LONG. YES, YOU WANT
FOR FIGURE OUT THE BASIC ARCHITECTURE
AND HOW THINGS ARE GOING TO WORK, BUT WE DON’T
SPEND A TON OF TIME SORT OF THE WATER
FALL MODEL OF SPENDING MONTHS
AND MONTHS AND MONTHS ON
ARCHITECTURE AND THEN DEVELOPING AND
THEN TESTING AND DEPLOYMENT. WE WILL BUILD STUFF VERY QUICKLY, WE’LL DO PROTOTYPE, WE’LL
DEPLOY IT AND TEST IT IN AN ENVIRONMENT
THAT IS LESS DEPENDENCE ON
PRODUCTION AND BE ABLE TO ACTUALLY
TEST STUFF IN PRODUCTION A LOT OF TIMES. SO ITERATE OVER
SOLUTIONS THAT WE BUILD AND IRON OUT THE THIS ONE HITS MOST
NETWORKS ALL THE TIME. IF YOU GO TALK TO LARGE NETWORKS, THEY’LL SAY, WELL, YOU GUYS CAN
DO IT BUT WE CAN’T BECAUSE WE HAVE LEGACY. WELL, WE’RE TECHNICALLY 14 YEARS OLD AS
WELL SO WE HAVEN’T NETWORKED FOR A LONG TIME. ONE OF THE THINGS
THAT WE DO ALL THE TIME IS WORRY
ABOUT NOT CREATING LEGACY. WE HAVE THIS
THING CALLED TECH ALL OF THE TEAMS
RUN A TECH DEBT WEEK EVERY SIX
TO EIGHT WEEKS. THEY CALL IT
THAT TO PAY OFF THE DEBT THAT WE HAVE CREATED. THE THINGS THAT MAY
BE BROKEN, THE THINGS THAT MAY NOT SCALE, THE THINGS THAT ARE HUNTERING!$!!!!INDERING US. WE WILL TAKE TIME. STOP ALL PROJECT WORK. EVERYBODY GETS IN
A CONFERENCE ROOM AND PUTS THINGS ON A BOARD, THESE ARE THE
THINGS WE’RE GOING TO DO, CROSS
THEM OFF ONE BY ONE AND GET THROUGH IT. NOT CREATE LEGACY. ALSO GOES INTO HOW WE DO WE WILL REPLACE
BOXES IN THREE WE DO NOT HOLD ON
TO TECHNOLOGY IF WE DON’T HAVE TO.
THERE ARE DIFFERENCES IN OPTICAL VERSUS
IP BUT CREATING LEGACY OR PERPETUATING
LEGACY IS ONE OF THE THINGS THAT WE TRY TO AVOID. SO LET ME DIVE INTO HOW WE OVERALL THINK ABOUT DISASTER I’M SURE YOU
GUYS HAVE HEARD OF HURRICANE SANDY. SOME OF YOU MAY HAVE LIVED THROUGH IT,
ESPECIALLY PEOPLE IN NEW YORK. SO HURRICANE SANDY CAME THROUGH. A BUNCH OF BUILDINGS,
ESPECIALLY IN NEW YORK, WERE FLOODED. A COUPLE OF THE
POPS IN NEW YORK WENT OUT AS WELL. THERE WAS SIGNIFICANT
IMPACT TO THE INTERNET AS SUCH. WE STARTED LOOKING AT THIS HMM, WE HAVE FACILITIES,
TWO OF THEM THAT WERE NOT VERY
FAR FROM THE PATH OF SANDY. SO THE QUESTION FOR US WAS, HMM, WHAT
HAPPENS IF ACTUALLY THE PATH WAS DIFFERENT AND OUR DATACENTER GETS FLOODED, WHAT WOULD WE DO. SO IN ANY GIVEN
FACILITY, YOU HAVE HUNDREDS OF TERABYTES OF TRAFFIC, THOUSANDS
OF SERVICES RUNNING ON THOUSANDS OF CAN WE REALLY AFFORD TO LOSE CAN WE REALLY
AFFORD TO DO THAT? WE WERE SCRATCHING
OUR HEADS SAYING, HMM, I’M NOT SURE WE
CAN ACTUALLY DO THIS, THAT WE CAN
ACTUALLY TAKE TENS OF THOUSANDS OF
MACHINES AND THEN TURN THEM SO WE STARTED THIS EXERCISE CALLED SANDSTORM. WE STARTED IN OCTOBER OF 2012. IT TOOK US ABOUT 18 MONTHS. THE GOAL WAS TO TURN OFF A DATACENTER FOR 24 HOURS, JUST DISCONNECT IT. IT TOOK US 18 MONTHS OF THIS WAS THE TASK LIST. THERE ARE TONS
OF TOOLS IN IT. MONTHS AND MONTHS
WORTH OF WORK THAT PEOPLE DID TO
ACTUALLY BE ABLE TO FIGURE OUT CAN WE ACTUALLY TURN
ONE FACILITY OFF. WE PLANNED THE EVENT. EVEN AFTER 18 MONTHS, WE HAD TO GIVE EXCEPTIONS TO A BUNCH OF SERVICES BECAUSE
THEY WERE LIKE, SORRY, THAT’S GOING TO HAPPEN, WE CAN’T
SURVIVE THIS DATACENTER ACTUALLY GOING OUT. MOST OF THEM WERE I.T. SERVICES AS WELL
BECAUSE OUR CORPORATE SIDE LIVES IN THE SAME DATACENTERS AND
THEY WERE LIKE, NO WAY, WE CAN’T HANDLE THIS. WE’RE NOT GOING TO DO THIS. SO WE GAVE A BUNCH OF PEOPLE THE DAY OFF WAS REALLY FUN. WE HAD A LITTLE WAR ROOM SET UP. ABOUT 23 PEOPLE IN THERE. EVERYBODY SITTING THERE WITH THEIR SCREENS
ON AND WATCHING A BUNCH OF THINGS,
ALL THE TOOLS WE HAD DEVELOPED
TO MONITOR WHAT AND THEN MY BOSS
WAS THERE, JAY, WHO RUNS ALL OF THE
INFRASTRUCTURE AND WE WERE JUST AND THEN HE
WALKED BY AND SAID, SO, YOU GUYS
HAVE BEEN PLANNING THIS, THIS IS AWESOME, GREAT, BUT YOU’RE
REALLY NOT GOING TO
DO THIS, ARE YOU? AND JAY WAS LIKE, NO, WE’RE GOING TO. WE’RE JUST GOING TO DO IT. WE PLANNED IT. WE MIGHT AS WELL
DO THIS THING. AND HE WAS LIKE HORRIFIED. HE WAS LIKE IS
THIS GOING TO BE A PRESS EVENT? ANYWAYS, WE DID IT. IT ACTUALLY WORKED. WITHOUT A WHOLE LOT OF PAIN. IT TOOK HOURS TO
TAKE IT DOWN AND HOURS TO BRING IT BACK UP. FOR 24 HOURS, ONE DATACENTER
WAS OUT. STILL AFTER THIS MUCH PLANNING, IF YOU
SEE ALL OF THE CHOPPINESS, THAT NEEDED TO GO WE FINALLY MANAGED
TO GET IT TO THIS POINT WHERE
A FACILITY GOES OUT, IT’S REALLY
CLEAN AND THE REST OF THE DATACENTERS PICK
UP THE LOAD AND WE’RE FINE. SO THE QUESTION
REALLY WAS HOW DO YOU GET THERE. FOR US IT WAS
FIRST YOU HAVE TO MONTHS, THAT’S A LOT
OF WORK, A LOT OF COMMITMENT, A LOT OF
TIMES THAT PEOPLE FELT, EH, MAYBE WE
SHOULD NOT DO THIS. BUT YOU HAVE TO
COMMIT, YOU HAVE TO GO DO IT. THIS IS THE OPERATIONS FIRST MINDSET TO DO THAT. TOOLING, TOOLING. SIMPLE THINGS LIKE DATA COLLECTION, DO
NOT RELY ON HUMAN BEINGS TO DATA COLLECT. THE BEFORE AND
AFTER SNAPSHOTS OF EVENTS SHOULD ALL BE YOU HAVE TO BE
ABLE TO DO THAT BECAUSE THAT
COLLECTS THE RIGHT DATA SO YOU CAN
ACTUALLY FIGURE OUT WHAT HAPPENED. BECAUSE IN AN
ENVIRONMENT THIS COMPLEX, YOU CANNOT
DO DETECTION BY HAND BY HUMAN BEINGS. YOU NEED MACHINE
LEARNING TO FIGURE OUT WHAT EVENTS ARE
RELATED AND WHAT SEQUENCE THEY FOR THAT YOU NEED
ACCURATE DATA AND TOOLING IS REALLY,
REALLY IMPORTANT. AND THEN EMBRACE FAILURE. AGAIN, DON’T TRY TO BUILD FAILPROOF STUFF,
TRY TO BUILD STUFF THAT FAILS FAST AND
YOU’RE ABLE TO DETECT IT. SO THIS IS ONE
OF THE POSTERS WE USE ALL THE TIME. WHAT WOULD YOU
DO IF YOU WEREN’T WE’RE LIKE GREAT,
LET’S DO IT AGAIN. WE STARTED DOING
A BUNCH OF STORM EXERCISES. SO IF YOU NOTICE SORT OF THE RHYTHM, IT WAS
EVERY SIX MONTHS THAT WE STARTED
DOING ANOTHER STORM EXERCISE, WHICH WAS
ESSENTIALLY PICK A DATACENTER, PLAN FOR SIX MONTHS, TURN IT
OFF, KEEP IT OFF FOR A DAY AND COME BACK. ALL THIS WAS GOOD. WE’RE LIKE, OKAY, SO WHAT DO WE DO NEXT? WE BUILT THIS
TOOL, WHICH WAS A JAY WOULD ROLL THE DICE. WHICHEVER
DATACENTER CAME UP, WE HAD 48 HOURS TO TURN IT OFF. THE ONLY PROBLEM WAS
WE VERY QUICKLY REALIZED THIS IS NOT A SCALEABLE SOLUTION. THE DICE HAS SIX SIDES. WE HAVE MORE DATACENTERS. SO WE WENT WITH, OKAY, FINE, WE’LL JUST DO IT AGAIN. SO NOW WE ESSENTIALLY
PICK OR JAY PICKS A DATACENTER
RANDOMLY AND GIVES US 48 HOURS TO TURN IT OFF. AND WE DO THAT ALL THE TIME. AND WHEN WE’RE
SAYING A DATACENTER, WE’RE TALKING ABOUT MEGAWATTS OF
CAPACITY, TENS OF THOUSANDS OF MACHINES
THAT JUST GO OUT. SO WHAT DO WE GAIN
BY DOING THAT? FIRST LEARNINGS,
LOTS OF THEM. YOU REALLY DON’T
UNDERSTAND YOUR DEPENDENCIES
UNTIL YOU ACTUALLY LOSE SOMETHING. THAT’S — YOU
CAN DO THE PAPER EXERCISES ALL YOU WANT. YOU WILL FIND
CONDITIONS THAT YOU DON’T UNDERSTAND, YOU DID
NOT UNDERSTAND IN TERMS OF THINGS
FAILING AND TRYING TO RECOVER IT IN ANOTHER LOCATION. THERE ARE A BUNCH
OF CONDITIONS AT TIMES THAT YOU CAN’T RECOVER FROM OR IT JUST
TAKES TOO LONG TO GET THERE. SO TONS OF
LEARNINGS ABOUT HOW SERVICES ARE
DEPENDENT ON EACH OTHER, HOW THEY’RE
DEPENDENT ON THE NETWORK, WHAT TRANSIENT
EVENTS HAPPEN WHEN THINGS GO UP AND DOWN, AND
THAT’S WHERE YOU LEARN OVER AND OVER AGAIN. AGAIN, DON’T
WANT TO BEAT UP ON AUTOMATION TOO
MUCH, BUT WITHOUT THAT YOU CAN’T LIVE. THE MORE INTERESTING
THING THAT WE SAW WAS THAT IT DOES GET BETTER APP
DEVELOPER BEHAVIOR. BECAUSE NOW THERE IS NO WE PICK A DATACENTER, WE TURN
IT OFF IN 48 HOURS, AND ALL YOU GET IS A NOTIFICATION. SO IF YOU’RE A
SERVICE OWNER LIVING IN THAT DATACENTER, YOU
HAVE AT BEST 48 HOURS TO GET PERIOD. THERE ARE NO EXCEPTIONS. SO IF YOU HAVE NO EXCEPTIONS, AGAIN, THE COMMIT
PART THAT YOU’RE COMMITTING TO DOING THIS, YOU START
GETTING BETTER APP BEHAVIORS WHERE
YOU HAVE SERVICES THAT THEN WORRY
ABOUT DETECTING FAILURES THEMSELVES. I’LL GIVE YOU A PRETTY FUN EXAMPLE WHERE
ONE OF THE NETWORK MAINTENANCES WAS
GOING ON AND IT TOOK DOWN ADS. AS YOU MIGHT
IMAGINE, WE KIND OF LIVE ON ADS, THAT PAYS OUR ADS WENT DOWN FOR 20 MINUTES. CAME BACK AND WE DO OUT INTERVIEWS, MUCH
LIKE EVERYBODY WE WERE TALKING TO
THE ADS TEAM AND THE ADS TEAM IS LIKE, HEY, YEAH, WE DIDN’T
DETECT IT, HERE’S WHAT HAPPENED. THEY EXPLAINED IT. OH, BY THE WAY, THIS WAS CAUSED
BY A NETWORK EVENT. AND THE NETWORK,
YOU KNOW, WE ARE LOOKING AT THAT SAYING, OKAY, WE’LL TELL
YOU WHAT HAPPENED AND JAY GOES, NO, NO, NO. GUYS, YOU HAD THE
SAME PROBLEM SIX MONTHS AGO. YOU SAID YOU
WERE GOING TO BUILD DETECTION CAPABILITIES AND WHAT HAPPENED? THEY EXPLAINED, NO, NO, THIS WAS A CORNER CASE, WE DIDN’T
ANTICIPATE A PARTICULAR SCENARIO AND THIS
SCENARIO COST US AND WE FIXED IT AND
IT’S NOT GOING TO HAPPEN AGAIN AND
WE SAID THANK YOU, MOVE ON. THE NETWORK WAS NEVER A CONVERSATION IN
THAT WHOLE OUTAGE REVIEW. IN THE END IF
SOMETHING FAILS AND YOU GET HURT, SHAME ON YOU, NOT WHAT BROKE. BECAUSE THINGS WILL BREAK. AND THAT’S WHAT
REALLY THE BETTER BEHAVIOR IS. YOU WORRY ABOUT WHAT ARE THE DEPENDENCIES AND
HOW ARE THEY GOING TO FAIL AND THIS WHOLE
EXERCISE HAS MADE THIS FRONT AND
CENTER FOR A LOT OF SERVICES. SO THAT IS THE
REALLY KEY BENEFIT THAT WE GET OUT OF
THIS SORT OF WHOLE EXERCISE. AND THIS ONE IS
SORT OF OBVIOUS, FASTER, MORE PREDICTABLE THAT’S SOMETHING
THAT IS RELEVANT TO ANYBODY WHO RUNS SO THE SPECIFIC
— ONE SPECIFIC EXAMPLE WOULD BE NAME ALL OUR OUTAGES WITH SOMETHING
AND THIS ONE WAS CALLED TREE DOWN
BECAUSE IT WAS A TREE THAT TOOK US DOWN. DON’T WANT TO EXPLAIN WHY. IT TOOK ENOUGH FIBER OUT, WE HAD A
COUPLE OF OTHER FIBER OUTAGES AND THE
THIRD PATH WAS AERIAL AND A TREE TOOK IT DOWN. AND OUR FOREST CITY
DATACENTER WAS RUNNING ON A VERY SKINNY
THREAD AT THAT POINT. FUN PART, FOREST
CITY DATACENTER ALSO HOLDS OUR
MASTER DATABASE. THAT JUST MEANS
THAT THE SITE IS SO THE FUN PART IN
THIS WHOLE TREE DOWN WAS WHEN WE DETECTED IT, WE GOT TO
THAT POINT SAYING, OKAY, WE HAVE MASSIVE CONGESTION, WE’RE
NOT GOING TO GET INTO FOREST CITY DATACENTER. WE NOTIFIED SERVICES
AND WITHIN 20 MINUTES ALL OF THEM WERE OUT. YES, WE WERE DOWN FOR 20 MINUTES BUT WITHIN 20 MINUTES EVERYBODY WAS OUT. IF THAT WAS NOT THE CASE, WE COULDN’T DO IT. WE ACTUALLY —
IT TOOK US EIGHT HOURS TO BRING BACK ENOUGH CAPACITY, FIX
ENOUGH FIBERS TO BRING THE DATACENTER BACK. IF WE DIDN’T HAVE THE ABILITY, WE WOULD BE DOWN FOR 8 HOURS. I WOULD TAKE 20 OVER 8 ANY DAY, BUT STILL. WE’RE STARTING
TO DO THAT WITH OUR POPS AS WELL. POPS IN ABOUT 68 COUNTRIES OUT OF
THE WORLD WHERE WE CONNECT TO DIFFERENT ISP s. THE INTERESTING
THING ABOUT POPS IS IF YOU TAKE DOWN A POP, IT ACTUALLY AFFECTS THE ISP s BECAUSE OF
THE TRAFFIC WE SEND ISP s, THAT PATTERN
CHANGE AS WELL SO WE CAUSE A LOT OF
HAVOC ON THE INTERNET ITSELF IF YOU GO TURN POP s OFF, BUT WE’RE STARTING TO DO THAT. WE HAVE A BUNCH
OF PARTNERS THAT ARE SIGNED UP TO
ACTUALLY TEST WITH US. I’M GOING TO TAKE
ONE POP AT A TIME OUT AND SEE WHAT HAPPENS. THE ULTIMATE GOAL IS TO UNDERSTAND THE USER EXPERIENCE. SO IF YOU’RE SITTING
IN INDONESIA ON A PARTICULAR SUBNET, A POP GOES OUT, WHAT’S YOUR EXPERIENCE? TODAY IT’S REALLY
HARD TO SAY WHAT IS YOUR EXPERIENCE. YES, THE TRAFFIC
WILL MOVE SOMEWHERE ELSE, BUT WE WANT TO
BE ABLE TO REALLY CHARACTERIZE IT, REALLY BE ABLE
TO MITIGATE IT AND FIND WAYS TO
MAKE THAT USER EXPERIENCE BETTER AND
THAT’S WHAT THE NEXT STEP IS FOR THIS EXERCISE. ALL RIGHT. SO TO SUMMARIZE, AUTOMATION, AUTOMATION, THAT’S UNDERSTAND YOUR DEPENDENCIES. WHICH, AS I SAID,
UNLESS YOU TURN THINGS OFF, YOU’RE
ACTUALLY NOT GOING TO KNOW YOUR
DEPENDENCIES TO THE EXTENT THAT YOU NEED TO KNOW. SO UNDERSTAND YOUR DEPENDENCIES,
UNDERSTAND YOUR SERVICE THE INTERESTING
THING THAT PEOPLE FORGET A LOT OF TIMES, THE TOOLS THAT YOU
USED TO MONITOR AND TROUBLESHOOT ALSO
LIVE IN THE SAME DATACENTERS. SO MAYBE A SERVICE — YOU — DEPENDENCIES BUT MAY NOT HAVE THE TOOLS TO DO IT. WE USE IRC s DURING THE OUTAGES
AND SOMETIMES IT’S HOSTED IN A
DATACENTER THAT GOES OUT AND THAT STUFF HAPPENS. VALIDATION. VALIDATION IS
ACTUALLY TESTING IT, AND YOU NEED TO TEST AND VALIDATE WHAT YOU HAVE GOING. AND THAT WE USUALLY CALL THINK
WRONG. THAT’S ALL I HAVE. HAPPY TO TAKE
QUESTIONS IF YOU GUYS HAVE ANY QUESTIONS. SO THIS PLANNING THAT
YOU HAD FOR 48 HOURS, YOU PLANNED
TO TAKE DOWN A DATACENTER, THAT
MEANS WHEN YOU ACTUALLY HAVE A HURRICANE, YOU GUYS
PLAN WITHIN 48 HOURS TO SHUT DOWN
THE DAT E$! A$! E$!ACENTER IS IN THE PATH. >>I’LL JUST REPEAT THE THE QUESTION IS
THAT IF THERE’S A HURRICANE COMING, DO
YOU ACTUALLY PLAN TO TAKE DOWN A
DATACENTER IN 48 HOURS. WE ACTUALLY
HAVEN’T HAD TO LIVE THROUGH A MAJOR
HURRICANE SINCE THAT WAS IN OUR PATH, BUT WE DO THESE EVENTS
SEVERAL TIMES WHERE WE’VE HAD INSTANCES
WHERE TWO CABLES IN THE PACIFIC ARE OUT. THERE’S NOT ENOUGH
CAPACITY AND WE WOULD DRAIN A DATACENTER AND SAY THIS IS NOT
WORTH TAKING THE RISK OF REALLY
TAKING IT OUT SO WE’LL JUST DRAIN
IT AND PUSH IT SOMEWHERE ELSE. SO WE DO DO THOSE
PLANNED EVENTS IN RESPONSE TO OTHER
EVENTS THAT ARE HAPPENING. CAN I REQUEST FOLKS TO
COME TO THE MIC. YEAH, JUST ONE QUICK MULTI-SIGHT FAILOVER IS THAT
AUTOMATED OR IS THAT THE FAILOVER IS PER SERVICE. THE PER SERVICE WORRIES ABOUT FAILOVER AND
DIFFERENT SERVICES HAVE SORT OF DIFFERENT
ENVIRONMENT BUT IT’S MOSTLY
AUTOMATED THAT YOU WILL FAIL. THERE ARE SERVICES
THAT ARE EASIER IN THE SENSE WHERE IF
YOU LOOK AT THE FRONT END, THE WEB SERVICES, IT’S REALLY AUTOMATED. IF A FRONT END
CLUSTER GOES OUT, NOBODY DOES ANYTHING, THINGS
JUST MOVE AWAY FROM IT AND THINGS GO ON. IF YOU’RE TALKING ABOUT DATABASES, IT’S A LITTLE MORE CAREFULLY PLANNED
OR WATCHED EVENT IN THAT SENSE. I’M NOT SURE HOW YOU
WOULD DECIDE WHETHER YOU WANT A FAILOVER LOCALLY VERSUS FAILOVER GEOGRAPHICALLY? SO IN THIS PARTICULAR
SCENARIO FOR DATACENTER, WHEN
YOUR WHOLE DATACENTER IS GOING DOWN, YOU’RE NOT FAILING LOCALLY, YOU HAVE
TO FAIL TO ANOTHER GEOGRAPHIC LOCATION. SO YOU HAVE BOTH
SCENARIOS IN THIS CASE. BUT MOST OF THE TIME IT IS A SCENARIO WHERE YOU’RE FAILING OVER TO A
DIFFERENT GEOGRAPHY. SO THE FAIL FAST IS
CLEARLY THE RIGHT WAY TO GO. I’M CURIOUS ABOUT
TWO THINGS HOW YOU DO AT FACEBOOK. ONE IS DO YOU HAVE THE ROLLBACK PROCESS
TOTALLY AUTOMATED OR OPERATIONS PEOPLE
ARE DOING THE SO I’LL SPEAK TO THE NETWORK THE NETWORK PHILOSOPHY IS TO ESSENTIALLY DRAIN
IT AND TAKE IT SO WE’RE ESSENTIALLY BUILDING PARALLEL A LOT MORE. SO WE DON’T TRY TO DO ROLLBACK, BECAUSE ROLLBACKS TEND NOT
TO BE CONSISTENT BECAUSE I WORRY ABOUT THE 1% FAILURE. IT WORKS 99% OF THE TIME. THE 1% PATHOLOGICAL CASE
THAT YOU GET STUCK WITH. SO WE HAVE A LOT
OF TOOLS BUILT THAT ARE DESIGNED
TO JUST DRAIN THEY ARE AUTO
DRAINS WHERE THEY HAVE CERTAIN
CONDITIONS MET AND THEY’LL DRAIN A
BUNCH OF DIFFERENT DEVICES IN THAT SENSE. SERVICES DO A MIXTURE. SOME SERVICES
WILL HAVE PEOPLE GO IN AND MOVE STUFF OUT AND THEY SAID WEB
SERVICES TYPE STUFF ALL AUTOMATED. AND SO THE OTHER PART
WAS LET’S SAY YOU ARE ROLLING OUT A A FEATURE IS ALREADY THERE. NOW YOU GET INTO A
BIND BECAUSE IF YOU ROLL BACK YOU
CAN’T TAKE THE FEATURE AWAY SUDDENLY, SO
HOW DO YOU HANDLE THAT? SO I THINK FEATURE
OR WAYS THAT WE ACTUALLY DO MOST OF THE TIME, THAT IS THE
SAFEST WAY TO DO IT. ON THE INFRASTRUCTURE SIDE, WE HAVE A
SLOW DEPLOYMENT METHODOLOGY WHERE
WE’LL DO IT IN VERY SMALL PLACES,
IN A FEW PLACES, MOST LIKE EVERY OTHER LARGE INFRASTRUCTURE DOES AND THEN GO EXPONENTIAL, WHICH IF
YOU HAVE TO TURN IT OFF, IT’S A
SMALL INCIDENT. AND THEN AT THE SERVICE LEVEL WHERE THE USERS SEE IT, WHICH IS WHY WE HAVE THE MICROSERVICES ARCHITECTURE, WE
WILL TAKE UNUSUAL SERVICES AND TURN THEM AT TIMES YOU’LL
SEE FACEBOOK IS FINE BUT THE CHAT IS NOT THAT’S BECAUSE WE TOOK CHAT OUT.>>NA JAM,
THANK YOU FOR THE I WAS WONDERING
ARE YOU THINKING ABOUT TAKING MORE THAN ONE DATACENTER DOWN? YEAH, I MEAN THAT’S
SOMETHING THAT WE’VE TALKED ABOUT. WE LOOK AT DUAL FAILURES IN A THE CONVERSATION
REALLY BECOMES ABOUT EFFICIENCY
BECAUSE ONE REGION FOR US IS ALREADY
FAIRLY LARGE, 150 MEGAWATT TYPE IF YOU TAKE TWO OUT, YOU’RE TAKING
A HUGE CHUNK OUT AT THE SAME TIME. SO WE HAVEN’T REALLY
TRIED TO TAKE TWO OUT AT A TIME. THAT JUST SEEMS LIKE A HUGE IMPACT ON THAT. WE TEND TO BUILD
ON THE STRATEGY OF WE REPLACE THE FACILITIES,
THAT THEY’RE NOT TOO CLOSE TO
EACH OTHER THAT THEY MIGHT GET HIT BY THE SAME EVENT. WE ALSO WORRY
ABOUT THEIR POWER GRIDS AND WHERE THE
POWER IS COMING FROM, WHICH POWER
GRID WE’RE ON AND THINGS LIKE THAT. POWER GRID IS
NOT THE CLEANEST ANSWER BECAUSE GRIDS
ARE PRETTY BAD AND ARCANE IN A BUNCH OF PLACES, BUT WE DO WORRY ABOUT SO THE DUAL-FAILURE SCENARIOS
ARE MORE PAPER EXERCISE OR THOUGHT EXERCISE RATHER THAN
TESTED EXERCISE. THANK YOU. NAJAM, THANKS FOR THE INSIGHTS TODAY. HOW DO YOU REACT
TO SECURITY — HOW DOES THE NETWORK REACT
TO A SECURITY EVENT WHERE IT MAY BE FOR ALL, HOW DO YOU REACT TO IT? SECURITIES, IT’S A WHOLE DIFFERENT BALL GAME. THERE ARE A BUNCH
OF THINGS WE’VE DONE IN THE SECURITY IF YOU GO TO ONE OF THESE CONFERENCES AT
SCALE, YOU’LL FIND A BUNCH OF PRESENTATIONS AROUND THAT. BUT FUNDAMENTALLY
WHAT WE’RE TRYING TO DO IS WE BUILD
CAPABILITIES THAT WE CAN BLOCK STUFF
AT THE EDGE MORE AND MORE. SO AT MULTIPLE LEVELS. SO AT THE SERVICE
LEVEL SINCE ALL OF THE REQUESTS THAT COME IN, THEY COME INTO
SORT OF FACEBOOK CODES, THE GLOBAL
LOAD BALANCER AND LOAD BALANCER AND
STUFF THAT WE DEVELOP, THERE’S A
LOT OF ABILITY BUILT INTO THAT SYSTEM
TO BE ABLE TO DETECT ANOMALIES OR FILTER STUFF OUT. SO WE CAN FILTER
STUFF OUT AT THE SERVICE LEVEL AND
THEN WE HAVE TOOLS BUILT THAT CAN
ACTUALLY FILTER OUT AT THE NETWORK LEVEL AS WELL. THAT’S REALLY THE PRIMARY WAY YOU WANT TO TRY AND PROTECT. AND THE OTHER THING
THAT DOES HAPPEN IS THAT THE ATTACKS
ACTUALLY DON’T MAKE IT TO THE DATA CENTERS. IF THEY HURT,
THEY HURT THE POPs AT THE EDGE
BECAUSE THAT’S WHERE THEY ENTER OUR
NETWORK AND WHERE WE USUALLY STOP THEM. WORSE COMES TO
WORST, WE’LL SHUT THE POP DOWN. COMING TO THE SAME
EXAMPLE YOU GAVE, WHAT IS THE
GUIDANCE THE NETWORK TEAM GIVES TO THE
APPLICATIONS TEAM BECAUSE THERE’S A
NEW NETWORK AND NEW COMING, NEW SECURITY POLICIES, NEW GROWTH POLICIES, SO HOW
DO YOU REACT WITH THE APPLICATION TEAM? IS IT ON AN ONGOING BASIS OR DO
YOU HAVE WEEKLY CONFERENCE? SO IT’S ACTUALLY NOT
AS OFTEN AS YOU MIGHT THINK. I OFTEN EVEN SAY
THAT MOST OF THE TIME THE SERVICES, ALL THEY
KNOW ABOUT THE NETWORK IS THEY PUT
THE PACKET ON THE NET AND HOPE TO
GOD IT GETS TO THE OTHER END AND
THAT’S ABOUT ALL IT THAT’S ABOUT ALL
THEY KNOW ABOUT NETWORK. THE OTHER THING
THEY KNOW ABOUT THE NETWORK NOW IS IF IT
DOESN’T GET THERE, IT IS THEIR PROBLEM
TO DETECT AND MITIGATE. THAT’S SORT OF HOW WE LEAVE IT. I’M NOT A HUGE
FAN OF TRYING TO STAY IN SYNC WITH THE SERVICE GROUPS AT A MORE
GRANULAR LEVEL, PRIMARILY BECAUSE WE ADD 300, 400 ENGINEERS A MONTH AND
TRY TO EDUCATE THAT MANY ENGINEERS
ALL THE TIME IS AN IMPOSSIBLE TASK. SO IF YOU TRY TO
BUILD THAT, YOU’RE JUST NOT GOING TO GET SO THE LESS GRANULAR,
MORE ABSTRACT CONCEPTS OF THIS IS
WHAT THE NETWORK IS AND THIS IS WHAT
IT GIVES YOU IS USUALLY BETTER OFF. YOU MENTIONED HOW
LIKE FACEBOOK HAS WHATSAPP, INSTAGRAM
AND MESSENGER AND DIFFERENT APPS AND
SOME OF THEM YOU ACQUIRED AND DIDN’T DEVELOP IN-HOUSE. IS ALL OF THAT
STUFF ON THE SAME INFRASTRUCTURE NOW? FOR THE MOST PART, YES. OUR STRATEGY
USUALLY IS THAT WE DON’T FORCE
INTEGRATION TO START SO I’LL GIVE YOU
INSTAGRAM FOR AN EXAMPLE. WHEN INSTAGRAM WAS FIST!$!RST ACQUIRED, THEY
STARTED LIFGD!$!VING IN OUR BUILDING AND
WEREN’T USING OUR INFRASTRUCTURE. AND THEN OVER
TIME THEY STARTED SEEING THE NEED
TO ACTUALLY START USING AND THEN WE
RAN THIS PROGRAM CALLED INSTAGRATION AND
MOVED IT OVER TO THE IF WE DON’T BUILD
AN INFRASTRUCTURE THAT A THIRD PARTY
THINKS WE CAN USE AND GET BENEFIT
FROM, THEY WON’T COME. WHATSAPP WAS ANOTHER ONE. CULTURALLY THEY
WERE LIKE, HEY, WE DO IT OUR WAY. SO THEY TOOK A LONG TIME TO GETTING TO THE
POINT OF USING THE SERVICES THAT FACEBOOK HAD, BUT NOW
THEY’RE ALL ON IT. SO WE TRY TO CONVINCE THEM BY SHOWING THEM
WHAT GAINS THEY CAN GET OUT OF IT. SO, FOR EXAMPLE, WHATSAPP, I’LL
GIVE YOU A SIMPLE EXAMPLE. THEY WERE USING A
SERVICE FOR INDIA TO DO VOICE CALLS. THEY WERE USING
A SERVICE FROM A COMPANY BASED IN LONDON. AND SO THE VOICE
CALLS WERE ACTUALLY GOING ALL THE WAY TO
THE UK AND COMING BACK TO INDIA. THE QUALITY OBVIOUSLY SUCKED. WE’RE LIKE, HEY, GUYS, WE HAVE CACHES, WE HAVE BOXES WE CAN DEPLOY, NOT A PROBLEM. WE’RE LIKE NOTHING
— DEPLOY TEN OF THEM AND SAY THEY’RE OUT THEY TRIED A FEW AND SAID, WAIT, CAN YOU GET US MORE? PRETTY SOON WE
HAVE HUNDREDS OF CACHES DEPLOYED
TO HANDLE VOICE CALLS IN INDIA FOR WHATSAPP, SO THAT’S
OUR STRATEGY TO WIN THEM I’M SURE FOR THE BROADER
INFRASTRUCTURE THERE WAS PROBABLY
A LOT OF PLATFORM CHANGES YOU HAD
TO MAKE TO OFFER FEATURES THAT THEY WANT TO USE. FROM THE NETWORK PERSPECTIVE, WERE
MOST OF THE CHANGES THAT YOU HAD TO
MAKE FROM THE SCALE PERSPECTIVE OR ALSO JUST LIKE FEATURES AND YOU
CAN’T DO THIS WITH OUR FACEBOOK NETWORK
INFRASTRUCTURE? YEAH, SO WE’RE TRYING
NOT TO GO TOO CUSTOM. IT’S THE COMPLEXITY VERSUS SIMPLICITY
CONVERSATION WHERE THE MORE COMPLEXITY YOU
CREATE, THE HARDER IT IS TO MANAGE LARGE SCALE. WE TRY TO BUILD ABSTRACTIONS WHERE WE SAY THESE
ARE THE SERVICES AND THEN WE WILL WORK
WITH INDIVIDUAL PRODUCT GROUPS
TO EXTRACT THEIR PARTICULAR
REQUIREMENT DOWN INTO A SERVICE THAT THE NETWORK
CAN PROVIDE AND SORT OF HAVE THAT INTERFACE. SO AS I SAID, WE TRY NOT TO MICROINTERFACE WITH
APPS OR PRODUCT GROUPS VERY OFTEN
BECAUSE THAT TO ME JUST DOESN’T SCALE
AND YOU END UP IN ALL SORTS OF WEIRD
SCENARIOS THAT YOU DON’T UNDERSTAND AT THAT THANK YOU.
>>THANKS.AWESOME, THANK YOU. THANK YOU, NAJAM. WE NOW HAVE 15 MINUTES BREAK, SO 10:45 AND WE’LL
RESUME AFTER THAT.

Leave a Reply

Your email address will not be published. Required fields are marked *