This is a two-part series on crawling Linkedin in scale. In an earlier article, we studied why Linkedin is a hard target to crawl. In this follow-up, I will dive deep into a technical tutorial on how you can crawl Linkedin in scale with demo code.
Update 17th June 2020: Proxycurl has released an API for crawling Linkedin Profiles for $0.01 per profile, I highly recommend you take a look at their API: https://nubela.co/proxycurl/linkedin
In this tutorial, I will lead you with code to get to a full name of a person's Linkedin profile. While this tutorial focuses on only 1 profile, the method used in this tutorial can be used to scale to as many asynchronous nodes as you want.
Setting up prerequisites
- Python 3
requests
- A Proxycurl credential (username and password)
How to get a proxycurl credential
You can request a free trial Proxycurl credential at Proxycurl's website. However, with the trial credential, you are rate limited to 1 request every minute.
If you require a credential with higher rate limits, please send an email to
[email protected]. You will be required to pay a trial fee for a trial key with higher rate limits.
1. Start with LinkedIn profile and make a Proxycurl request
Let's start with a LinkedIn Profile, say Bill Gate's LinkedIn Profile: https://www.linkedin.com/in/williamhgates/
We will use Proxycurl's browser crawl because LinkedIn's page requires javascript for the page to be rendered. Let's go into the Python code:
import requests
import json
API_HOSTNAME = 'https://replace.me.with.proxycurl.hostname.com/some_endpoint'
payload = {
'id': 'bill-gates-crawl-id',
'url': 'https://www.linkedin.com/in/williamhgates/',
'type': 'browser',
'headers': {'LANG', 'en'},
}
r = requests.post(API_HOSTNAME, auth=HTTPBasicAuth('USER', 'PASSWD'), data=json.dumps(payload))
Let's break this down
In the code snippet above, you are making a Proxycurl request of type browser
which means it is a browser crawl request. A browser crawl request simulates opening the page in a real browser, with real user sessions.
The headers
parameter in payload
dictionary is there to ensure that the returned language is always in English because not all our nodes are located in english speaking countries. This ensures that the LANG
header in the request is overwritten with the value of en
.
Now that we have crafted the payload
, we will send this request off by calling requests.post()
. This makes an API request with the HTTP POST
method to Proxycurl servers, for which Proxycurl will forward this request to a randomly selected node.
All you have to do now, is wait for a response.
I tried this, but the response is not a proper LinkedIn profile page
Not all nodes are logged into LinkedIn. Please retry a few times until you get a positive result.
The page loads, but the page is not rendered
On slower computers or internet connections, the AJAX calls that the javascript scripts that are called when the page loads will take a longer time to complete. And when the page only has 500ms (or half a second) to
- Make AJAX requests to populate populate the page
- Render the UI elements from those AJAX requests
Then you should expect that results might be incomplete. To solve this problem, we have to increase the value of dom_read_delay_ms
from it's default of 500 (ms)
to 30000
(ms). What this does is that the browser is asked to wait 30seconds after the page has loaded (like JQuery's $(document).ready()
).
Modifying payload
to include dom_read_delay_ms
parameter
payload = {
'id': 'bill-gates-crawl-id',
'url': 'https://www.linkedin.com/in/williamhgates/',
'type': 'browser',
'headers': {'LANG', 'en'},
'dom_read_delay_ms': 30000
}
(Adding dom_read_delay_ms
to the payload)
2. Use BeautifulSoup to extract
full name from the HTML
from bs4 import BeautifulSoup
response_dic = r.json()
soup = BeautifulSoup(response_dic['data'])
h1 = soup.find_all("h1", class_="pv-top-card-section__name")[0]
print(h1.text)
Let's break down the code.
In line 1, we import BeautifulSoup
module. We use BeautifulSoup
to parse the HTML document retrieved from the Proxycurl request, and also to navigate the dom elements to extract relevant data.
In line 4, we unpack the response from requests
as a JSON string into a dictionary. The HTML document is contained in data
key of the dictionary, so we unpack that and initialize the BeautifulSoup
object in line 4.
With the BeautifulSoup
object initialized, in line 5, we search the HTML document for a h1
element with a class named pv-top-card-section__name
. Because the .find_all()
method returns a list, we instantiate the h1
variable with the first result in the list. (There should only be one actually).
Then, the full name of Bill Gates, will be printed out in line 6.
3. Scaling it up (an exercise for the reader)
In steps 1 and 2, we have built a prototype to extract a user's full name from his Linkedin profile. But there are a lot more things because you have a full-fledged crawler that is scalable. Here are some suggestions:
- Consider using
asyncio
to launch multiple requests - Noticed that each Proxycurl request takes quite a bit of time, especially so after you increased the
dom_read_delay_ms
to 30 seconds - which means requests take at least 30 seconds per request. You do not want to keep waiting for responses to return. Instead you can have Proxycurl callback to a web endpoint that you have setuped with a result. This is what the id
in the payload
is for. See asynchronous browser crawl document page for more information. - Check for errors and retry! Possible errors include and are not limited to:
- Page isn't rendered completely or properly
- Linkedin is not logged in
Get Started with Proxycurl LinkedIn API for $0.01 per profile
Update 17th June 2020: Proxycurl has released an API for crawling Linkedin Profiles for $0.01 per profile, I highly recommend you take a look at their API: https://nubela.co/proxycurl/linkedin