Tuesday, 8 September, 2020 UTC


Summary

Having built the early prototype for Proxycurl API which turns LinkedIn profiles into JSON, I learnt a little bit about how one might be able to scrape public Linkedin profiles in scale. In this tutorial, I will share my experience building a Linkedin profile scraper that works in 2022, and I hope you will find it useful.
PS: You can turn Linkedin profiles into JSON with Proxycurl API.
To put this tutorial in context, we will preface it with the problem of:
How to scrape 1 million Linkedin profiles, and then parse the HTML content into structured data?
Breaking down the problem:
  1. How to crawl a million Linkedin profiles and fetch their on-page HTML content
  2. How to parse the HTML content from a public Linkedin profile to structured data
Part 1: How to scrape 1M public Linkedin profiles for HTML code
Before we embark on the quest to scrape a million profiles, let's start with crawling ten profiles. There are only two ways to crawl ten Linkedin profiles for scraping:
  1. As a user logged into Linkedin. (A "logged in user")
  2. Or, as a user that is not logged into Linkedin. (An "anonymous user.")

1A: Accessing Linkedin profiles as an anonymous user

It requires luck to access a Linkedin profile without being logged into Linkedin.
In my experience, you might be able to access the first profile as an anonymous user if you have not recently clicked into any Linkedin profiles.
Even if you succeed viewing a public profile anonymously in your first attempt, more likely or not, you will be greeted with the dreaded Authwall on your second profile visit.
What is the Authwall and how do you circumvent it?
The Authwall exists to block web scraping from users who are not logged into Linkedin.
  • If you visit a public profile from a non-residential IP address, such as from a data center IP address, you will get the Authwall.
  • If you visit a public profile without any cookies in your browser session (aka incognito mode), you will get the Authwall.
  • If you are visiting a public profile from a non-major browser, you will get the Authwall.
  • If you are visiting a public profile multiple times, you will get the Authwall.
There are many reasons that you will be greeted with the Authwall when you are crawling anonymously. But there is one way you can reliably bypass it -- crawl Linkedin as Googlebot. If you can access a Linkedin public profile page from an IP address that belongs to Google, you can consistently fetch an available Linkedin profile without the Authwall.
What does an IP address from Google mean?
It is an IP address that resolves reversely to *.googlebot.com. See this Google support page for a clear definition. And no, IP addresses from Google Cloud instances do not work.
But, there is one page on Linkedin that you can crawl without restrictions
Put yourself in the shoes of a Linkedin executive. What makes you money? Profile data. Which is the Authwall is used to lock up profile data.
What else makes Linkedin money? Jobs! Linkedin makes money when companies list jobs on Linkedin. These companies will return to Linkedin again and again if Linkedin succeeds at matching great candidates to their job postings.
Job profiles on Linkedin are not blocked by the Authwall to maximize page views.

1B: Accessing Linkedin profiles logged into Linkedin

You and I are probably not Googlers, which means we do not have access to the range of addresses belonging to Googlebot. But there is respite.
You can log into Linkedin to reliably access Linkedin profiles. However, as tempting as it may be, I highly recommend that you not use your personal Linkedin profile to perform a bulk profile crawl for scraping purposes. You do not want your personal Linkedin profile to be blocked.
And it will be blocked should you scrape past a certain threshold or when Linkedin detects abnormal (automated) behavior in your account.
But yes, log into your Linkedin profile, and you can crawl ten profiles with no problems. And that brings me to the next section -- getting from 10 profiles to 1M profiles.
Can I crawl 1M Linkedin profiles to scrape by creating many Linkedin accounts?
It is only natural to veer towards the belief that you can build a Linkedin scraper if you manage a pool of disposable Linkedin accounts. You are not wrong. Building a pool of workers with disposable Linkedin accounts is indeed a feasible method if and only if humans meticulously manage each Linkedin account.
Once you begin automated crawls on any Linkedin account, you will start encountering random Recaptcha challenges on accounts that will keep an account locked until they are solved.
Each Linkedin account in your scraping pool will also require a unique residential IP address.
The short answer is yes. You can crawl 1M Linkedin profiles with many Linkedin accounts with residential IP addresses.

Recap: What you need to do to crawl 1M profiles

The first step to scraping is to get HTML code of profiles in scale. In this article, we put a number to "scale." One million profiles. There are only a few ways to crawl 1M Linkedin profiles, and they are
  1. Access Linkedin from an IP address the resolves as Googlebot
  2. Manage a large pool of workers logged in as individual Linkedin account, with each account sitting on residential IP addresses
  3. Use Proxycurl API -- see the next section.

Using Proxycurl API to enrich 1M Linkedin profiles

Proxycurl is an offering we built that provides a managed service to turn LinkedIn profile URLs into structured JSON data.
If you ask me which is the best way to scrape Linkedin profiles, then I will tell you in a very biased way to use Proxycurl's API. Specifically, the Person Profile Endpoint. Our Person Profile Endpoint takes a LinkedIn profile URL and returns you the structured data of the public profile.
Part 2: I have HTML code of a profile page, how do I scrape content off it?
Now that you have 1M profiles, it is time to get the content out of the HTML code into structured data. To convert HTML pages to structured data is what I define as "parsing." Crawling profiles gets you a bunch of pages as HTML code. Parsing turns pages of HTML code into machine-readable structured data, like this:
{
'accomplishment_courses': [],
'accomplishment_honors_awards': [{'description': 'Nanyang Scholarship '
                                                'recognizes students who '
                                                'excel academically, '
                                                'demonstrate strong '
                                                'leadership potential, and '
                                                'possess outstanding '
                                                'co-curricular records.\n',
                                 'issued_on': {'day': None,
                                               'month': None,
                                               'year': 2015},
                                 'issuer': 'Nanyang Technological University',
                                 'title': 'NANYANG Scholarship'},
                                {'description': 'Awarded to students with '
                                                'exceptional results in '
                                                'Physics and Mathematics',
                                 'issued_on': {'day': None,
                                               'month': None,
                                               'year': 2015},
                                 'issuer': 'Defence Science & Technology '
                                           'Agency',
                                 'title': 'Young Defence Scientist Programme '
                                          '(YDSP) Academic Award'},
                                {'description': 'An annual competition to '
                                                'encourage the study and '
                                                'appreciation of Physics as '
                                                'well as highlight Physics '
                                                'talent.',
                                 'issued_on': {'day': None,
                                               'month': None,
                                               'year': 2012},
                                 'issuer': 'Institute of Physics Singapore',
                                 'title': 'Singapore Junior Physics Olympiad '
                                          '(Main Category) Honourable '
                                          'Mention'},
                                {'description': 'Certificate awarded to '
                                                'student who topped the '
                                                'cohort in all aspects of '
                                                'Science.',
                                 'issued_on': {'day': None,
                                               'month': None,
                                               'year': 2010},
                                 'issuer': 'Xinmin Secondary School',
                                 'title': 'Certificate of Excellence - Top '
                                          'in Science'},
                                {'description': None,
                                 'issued_on': {'day': 1,
                                               'month': 9,
                                               'year': 2018},
                                 'issuer': 'Nanyang Technological University',
                                 'title': "Dean's List FY17/18"},
...
'volunteer_work': []}

Two ways to parse content from HTML code

There are two ways to scrape content from the HTML page, and the approach to take depends entirely on how the page is crawled.
Two factors decide which is the best method to use:
  • Is on-page javascript parsed before the HTML code of the profile page is collected?
  • Is the profile viewed as an anonymous user or as a user logged into Linkedin?
Method matrix for your reference
Anonymous user Logged into Linkedin
Javascript not rendered Dom Scraping Code Chunk Scraping
Javascript is rendered Dom Scraping Dom Scraping

Dom parsing

Dom parsing is the standard method that most developers use for web scraping. You can find the data within fixed HTML tags on a page that is loaded and rendered. You can fetch most content of a profile page by transversing HTML tags either via selectors or XPATH.
The problem is that the layout HTML pages are updated often and always. And layout varies according to locale. A profile loaded in Arabic locale will differ in layout from a profile loaded in English. Every time something changes, expect your scraper to break. Dom scraping is a high maintenance method but easy to implement.

Code Chunk Scraping

Code Chunk Scraping is a superior method reserved for profile pages fetched as a logged user; before javascript is rendered. It is a better method because it does not depend on HTML dom structure -- and that means that page layout changes on Linkedin will not break this scraping method. What it does instead is that it looks at the data in-page placed within <code></code> tags. These blobs of JSON data are used by Linkedin's javascript code to populate the page's dom elements. With the Code Chunk scraping method, you transverse JSON objects instead of Dom elements.
Because the JSON blob data is already stored in a structured manner, we do not have to tokenize strings to re-structure data and return the data as it is. That means you do not need to parse "12th March 2020" into a machine-readable Date object.
To recap: the Code Chunk scraping method
  • is faster to crawl because you can skip Javascript parsing
  • breaks less due to on-page layout changes
  • but, requires you to be logged into Linkedin when fetching profiles
Here is an example of data transversal with the Code Chunk Scraping method to return Patents Achievement from a user profile:
    def get_patents(data):
        patent_lis = []
        for dic in Person._type_in_include_rows(data,
                                                'com.linkedin.voyager.dash.identity.profile.Patent'):
            description = dic.get('description')
            application_number = dic.get('applicationNumber')
            issuer = dic.get('issuer')
            issued_on = None
            issued_on_dic = dic.get('issuedOn', {})
            if issued_on_dic:
                issued_on = Date(month=issued_on_dic.get('month'),
                                 day=issued_on_dic.get(
                    'day'),
                    year=issued_on_dic.get('year'))
            patent_number = dic.get('patentNumber')
            title = dic.get('title')
            url = dic.get('url')
            patent_lis += [Patent(description=description,
                                  application_number=application_number,
                                  issuer=issuer,
                                  issued_on=issued_on,
                                  patent_number=patent_number,
                                  title=title,
                                  url=url
                                  )]
        return patent_lis
So you want to build your own Linkedin Profile Scraper
In this article, I explained that scraping Linkedin profiles is a two-step process.
The first step is to crawl Linkedin profiles and save the HTML code for further processing in the second step. The second step is to process the HTML code and turn raw HTML code into structured data that you can use in your application.
There are only two methods to crawl Linkedin profiles in scale -- anonymously as Googlebot, or via a pool of workers logged into Linkedin with unique residential IP addresses. It is not impossible, but you can get yourself 1M HTML files if you work around these limitations.
The next step is to process these 1M HTML files and turn them into structured data for your application. If you crawled the page without rendering javascript but with an account logged into Linkedin, you should use the Code Chunk Scraping method, which is superior because it breaks a lot lesser. Otherwise, you can perform a regular scraping with your favorite Dom transversal library with the Dom Parsing method. (I recommend beautifulsoup4 if you are using Python)
Even if you are a well-funded startup, it is not trivial to crawl Linkedin data in scale. You need a secret weapon.

Proxycurl is a managed enrichment service for LinkedIn profile URLs.

Just like how you have chosen AWS instead of building and colocating your server farms, dataset acquisition is a menial task best left as a managed service. I can only write this article in such detail because of the combined expertise of our entire development team and learned experience over the years.
Why crawl Linkedin, when you can purchase an exhaustive LinkedIn (public) profile dataset loaded with data of Linkedin profiles in the US?
Why manage a LinkedIn profile scraper when you can use our API and get a LinkedIn Profile in structured data for $0.01 per profile?
I will love to help your business integrate data at the core of your product. Send an email to [email protected] and let me know how I can help you with your data needs! Let Proxycurl be your secret weapon.
The tutorial is not complete without code samples.
In this article, I shared in high-level how you might be able to scrape Linkedin profiles in scale. But a tutorial is not complete without code samples. In the follow-up article, I will be releasing fully-working code samples to complement this article. Please subscribe to Proxycurl's mailing list here to be notified of the next article with code samples!