Wednesday, 22 May, 2019 UTC


Summary

This is a two-part series on crawling Linkedin in scale. In an earlier article, we studied why Linkedin is a hard target to crawl. In this follow-up, I will dive deep into a technical tutorial on how you can crawl Linkedin in scale with demo code.
Update 17th June 2020: Proxycurl has released an API for crawling Linkedin Profiles for $0.01 per profile, I highly recommend you take a look at their API: https://nubela.co/proxycurl/linkedin
In this tutorial, I will lead you with code to get to a full name of a person's Linkedin profile. While this tutorial focuses on only 1 profile, the method used in this tutorial can be used to scale to as many asynchronous nodes as you want.
Setting up prerequisites
  1. Python 3
  2. requests
  3. A Proxycurl credential (username and password)

How to get a proxycurl credential

You can request a free trial Proxycurl credential at Proxycurl's website. However, with the trial credential, you are rate limited to 1 request every minute.
If you require a credential with higher rate limits, please send an email to [email protected]. You will be required to pay a trial fee for a trial key with higher rate limits.
1. Start with LinkedIn profile and make a Proxycurl request
Let's start with a LinkedIn Profile, say Bill Gate's LinkedIn Profile: https://www.linkedin.com/in/williamhgates/
We will use Proxycurl's browser crawl because LinkedIn's page requires javascript for the page to be rendered. Let's go into the Python code:
import requests
import json

API_HOSTNAME = 'https://replace.me.with.proxycurl.hostname.com/some_endpoint'
payload = {
    'id': 'bill-gates-crawl-id',
    'url': 'https://www.linkedin.com/in/williamhgates/',
    'type': 'browser',
    'headers': {'LANG', 'en'},
}
r = requests.post(API_HOSTNAME, auth=HTTPBasicAuth('USER', 'PASSWD'), data=json.dumps(payload))

Let's break this down

In the code snippet above, you are making a Proxycurl request of type browser which means it is a browser crawl request. A browser crawl request simulates opening the page in a real browser, with real user sessions.
The headers parameter in payload dictionary is there to ensure that the returned language is always in English because not all our nodes are located in english speaking countries. This ensures that the LANG header in the request is overwritten with the value of en.
Now that we have crafted the payload, we will send this request off by calling requests.post(). This makes an API request with the HTTP POST method to Proxycurl servers, for which Proxycurl will forward this request to a randomly  selected node.
All you have to do now, is wait for a response.

I tried this, but the response is not a proper LinkedIn profile page

Not all nodes are logged into LinkedIn. Please retry a few times until you get a positive result.

The page loads, but the page is not rendered

On slower computers or internet connections, the AJAX calls that the javascript scripts that  are called when the page loads will take a longer time to complete. And when the page only has 500ms (or half a second) to
  • Make AJAX requests to populate populate the page
  • Render the UI elements from those AJAX requests
Then you should expect that results might be incomplete. To solve this problem, we have to increase the value of dom_read_delay_ms from it's default of 500 (ms) to 30000(ms). What this does is that the browser is asked to wait 30seconds after the page has loaded (like JQuery's $(document).ready()).
Modifying payload to include dom_read_delay_ms  parameter
payload = {
    'id': 'bill-gates-crawl-id',
    'url': 'https://www.linkedin.com/in/williamhgates/',
    'type': 'browser',
    'headers': {'LANG', 'en'},
    'dom_read_delay_ms': 30000
}
(Adding dom_read_delay_ms to the payload)
2. Use BeautifulSoup to extract full name from the HTML
from bs4 import BeautifulSoup

response_dic = r.json()
soup = BeautifulSoup(response_dic['data'])
h1 = soup.find_all("h1", class_="pv-top-card-section__name")[0]
print(h1.text)
Let's break down the code.
In line 1, we import BeautifulSoup module. We use BeautifulSoup to parse the HTML document retrieved from the Proxycurl request, and also to navigate the dom elements to extract relevant data.
In line 4, we unpack the response from requests as a JSON string into a dictionary. The HTML document is contained in data key of the dictionary, so we unpack that and initialize the BeautifulSoup object in line 4.
With the BeautifulSoup object initialized, in line 5, we search the HTML document for a h1 element with a class named pv-top-card-section__name. Because the .find_all() method returns a list, we instantiate the h1 variable with the first result in the list. (There should only be one actually).
Then, the full name of Bill Gates, will be printed out in line 6.
3. Scaling it up (an exercise for the reader)
In steps 1 and 2, we have built a prototype to extract a user's full name from his Linkedin profile. But there are a lot more things because you have a full-fledged crawler that is scalable. Here are some suggestions:
  • Consider using asyncio to launch multiple requests
  • Noticed that each Proxycurl request takes quite a bit of time, especially so after you increased the dom_read_delay_ms to 30 seconds - which means requests take at least 30 seconds per request. You do not want to keep waiting for responses to return. Instead you can have Proxycurl callback to a web endpoint that you have setuped with a result. This is what the id in the payload is for. See asynchronous browser crawl document page for more information.
  • Check for errors and retry! Possible errors include and are not limited to:
  • Page isn't rendered completely or properly
  • Linkedin is not logged in
Get Started with Proxycurl LinkedIn API for $0.01 per profile
Update 17th June 2020: Proxycurl has released an API for crawling Linkedin Profiles for $0.01 per profile, I highly recommend you take a look at their API: https://nubela.co/proxycurl/linkedin