The Data

Building the Workforce of the Future

"There cannot be equity in society without equity in data collection, curation, and decisions."

Women in Big Data Founders

#SheCodes – How Big Data Fuels LinkedIn’s “People You May Know”

On September 26, 2017, nearly 200 data professionals gathered at LinkedIn headquarters in Sunnyvale to hear a presentation by seven of the teams that work in a massive, coordinated effort to bring us the “People You May Know” (PYMK) section of the LinkedIn web page.


Kapil Surlaker, Senior Director of Engineering at LinkedIn, began the presentation with an overview of the functional areas from the product side, data side, and infrastructure and platform space that contribute to the PYMK product. On the product side, there are product managers, developer teams and test engineers at work to create the applications for LinkedIn’s 500+ million members.

The Data infrastructure and Analytics platform space is represented by teams of Systems and Infra developers, SREs, and operations teams that all contribute to solving the enormous challenge of scaling the platforms.  Finally, there are Data analysts, Data scientists, and Relevance engineers who work to make each member’s user experience on the LinkedIn web page more compelling.

Hema Raghavan, Senior Manager and Head of Machine Learning for Growth at LinkedIn, introduced us to the PYMK product. Its mission: “To connect members to the people who matter most to them professionally, enabling them to access opportunities within the LinkedIn ecosystem.” PMYK gives members a nearly effortless way to grow their networks, thereby creating more opportunities and access to industry knowledge.


Mina Doroud is a Staff Data Scientist on the Analytics team at LinkedIn. The analytics team ensures that any changes to PYMK are data driven, true to the values of LinkedIn and, most of all, create a good member experience. Higher quality connections create a better user experience. The better the user experience, the more frequently they will return to the LinkedIn site. The goal is to present each member with the highest quality PYMK candidates, so they are more likely to request a connection. Two important metrics used to study this are “Invitations sent and accepted” and “Invitations received and accepted.” In both cases, attention is given to someone who is either sending requests and not getting accepts, or is receiving invitations but not accepting many of them. This is important since every new connection could introduce 1 to 32 jobs, 12 companies and 748 people.

Heloise Logan is a Staff Software Engineer working on machine learning and recommendation systems. She discussed how the PYMK framework was built. Candidate generation is a big part of PYMK. The objective is to generate a list of candidates ranked by the probability that the member will click to send a connection request. Learning is achieved by looking at the member profile, network and activity data. Heloise then went on to describe the offline and online systems that comprise the PYMK architecture. Offline systems are time and resource intensive. A social graph is used to identify possible 2nd degree candidates, and an economic graph is explored to find candidates that are outside of the social graph. Models are applied and then all pairs are scored and ranked. When a member adds a new connection, or has explored some other aspect of the LinkedIn site before coming to the PYMK page, those data are used to generate contextual candidates that are processed by the online system. Offline and online candidates are merged, re-scored and re-ranked, and the resulting list is presented to the member on their People You May Know tab. If a member likes a candidate, they are more likely to request a connection, which in turn feeds the “Invitations sent and accepted” metric.

The next presenter, Min Liu, a senior software engineer, discussed A/B testing at LinkedIn. She described her role as that of being half statistician and half engineer. The statistician side develops methodologies used in A/B testing. The engineering side implements them. LinkedIn measures success based on how many connections users are making. A/B testing is used to measure and quantify feature impact and to improve the user experience. Not only is A/B testing done automatically for over 2000 metrics, slice and dice is also performed on the results to achieve more granular visibility areas, such as user segment and geolocation. The “test everything” culture that exists at LinkedIn is very evident in the 300+ A/B tests that run concurrently every day. Each experiment has 2000+ metrics. Trillions of tracking events get sent through the platform every day. Despite this scale, the first A/B report is generally available within two hours and updated hourly after that.

Navina Ramesh is a Senior Software Engineer for the data infrastructure team, which is responsible for the collection, processing and accessibility of the data that must be processed as soon as it arrives and made available to the analysts for offline processing. The data must flow through the infrastructure as follows: Over ten-thousand tracking events (page views, clicks, etc.) and metrics (behavior after exposure to an A/B test) are generated per second. They need to be gathered from points around the globe and transported with low latency and high reliability. Events must then be processed with high reliability and fault tolerance, and seamlessly moved from offline processing systems to real-time online services where they must be easily queried.

Two key systems were developed in-house and are part of the massive infrastructure machine. Apache Kafka is an open-source stream processing platform. It is the distributed backbone that transports data throughout the system. All tracking events are ingested to Kafka. Apache Samza is a stream processing framework that was developed at LinkedIn. Data comes into Samza from Kafka and other sources. Over 220 applications rely on Samza, and most of those require stateful processing.

Suja Viswesan, Senior Engineering Manager for the big data platform, discussed how analytics are done at LinkedIn. She began by sharing the history of the People You May Know product. Nine years ago, LinkedIn had 40-50 million members. It took six months to do the computations necessary for PYMK. They were able to scrape together the resources to purchase a twelve-node Hadoop cluster that enabled them to run the same computations in two weeks.

Gobblin is the Apache platform that ingests all data sources, stores them in HDFS, and makes them available for analytics through an abstraction layer called Dali, a logical data access layer that enables a seamless user experience, no matter what changes are being made under the hood. Three additional systems have been developed in-house to assist in the management of the 2.3 trillion messages that are processed by the LinkedIn machine on a daily basis.

Cubert is a computational engine developed in-house  and used along with Spark, Hive, Pig and Presto. Azkaban is the LinkedIn-developed, open-source workflow management engine that keeps all of the plates spinning. And finally, Dr. Elephant provides visibility into how to tune 200,000+ jobs that run daily in the Big Data ecosystem. This tool helps LinkedIn scale people and systems.

Sandhya Ramu

Savitha Ravikrishnan

The final presenters for the evening were Sandhya Ramu, Director of Big Data SREs, and Savitha Ravikrishnan, Senior SRE. They co-presented on the role of system reliability engineering at LinkedIn.

200,000+ jobs are run on 10+ clusters every day. As the member base continues to grow, there are core principles that must be in place. The number one priority that guides all other decisions is that the LinkedIn site must be up and running. In support of that initiative, developers are empowered with self-serve tools with guard rails so that they can continue to perform their fast-paced functions. Any issues must be isolated and de-bugged as quickly as possible.

Savitha described the competencies needed to keep the Hadoop infrastructure healthy. Break/Fix automation ensures that the over 9,000 hosts are healthy, and it reports any issues. The configuration management system helps manage the configurations of HDFS, Yarn and MapReduce. Software is upgraded and restarted in a rolling fashion so the site never has to be taken down. Monitoring checkpoints are in place for all functionalities 24×7.

Security is taken very seriously at LinkedIn. New user onboarding and access revocation are managed in the Lightweight Directory Access Protocol, and anyone accessing Hadoop (or any of the services on its stack) must be authenticated by Kerberos.

The evening concluded with a Q&A session with all presenters.

Women in Big Data would like to thank LinkedIn and all of the presenters and organizers for hosting this amazing evening. LinkedIn is a model of excellence that we can all learn from.

Leave a Reply

Trackbacks & Pings