Wednesday, 21 November, 2018 UTC


Summary

Introduction

By definition, web scraping means getting useful information from web pages. The process should remove the hassle of having to browse pages manually, be automated, and allow to gather and classify the information you're interested in programmatically.
Node.js is a great tool to use for web scraping. It allows implementing web scraping routines in a couple of lines of code using the open source module provided by npm - the Node Package Manager.

The Main Steps of Web Scraping

As we have already defined - web scraping is nothing more than automating the manual browsing and collecting of information from specific websites in your preferred web browser.
This process consists of the 3 main steps:
  • Getting the HTML source code from the website
  • Making sense of the HTML content, finding the information we're interested in, and extracting it
  • Moving the discovered information to a storage of your choice (text file, database, etc.)
The first and final step are usually pretty much the same, depending on your application's requirements. However, making sense of the HTML content requires you to write specific code for every website you'd like to scrape.

Caution

Depending on your usage of these techniques and technologies, your application could be performing illegal actions
There are a few cases where you would want to be cautious about:
  • DoS - A Denial of Service attack practically relies on sending so many requests to the server that it simply can't handle any more. All new incoming requests would then be denied. If you're scraping a website too often, it can be considered a DoS attack.
  • Terms of Service - Many websites, and almost all bigger websites clearly state that web scraping on their platforms is prohibited. If many people scraped these websites, it would pretty much end up being a DDoS attack, which by itself is illegal.
  • Abusive Software - Many tools and frameworks online offer a large variety of tools and functionalities. Some allow users to fill forms, submit data, upload and download files etc. CAPTCHA is used to combat this, however, even this can be surpassed with a bot.

Manual Web Scraping Algorithm

As an example, we'll be looking up the Yellow Pages for companies providing printing services in New York.
First of all, let's define our task and the desired result:
Produce a list of the companies providing printing services in New York in the form of a ".CSV" file that should have the company name, email, phone, and link columns that would describe each company.
Here's how we would do that manually:
  1. Navigate our browser to the appropriate link and put in "printing" and "New York" into the search fields and execute the search
  2. Select and store the name of the first company in the list
  3. Store the link to the company page and follow it
  4. Find an email address and store it
  5. Write stored values to a ".CSV" file either by using a text editor or another tool for editing tables like Excel or Google Sheets
Repeating these steps multiple time will give us a table filled with company details.

Automating the Process with Web Scraping

To automate the process, we should follow the same steps programmatically.

Setting up the Development Environment

We'll be using Node.js and npm to develop this sample project. So make sure those tools are installed on your machine and let's start by running the following command in an empty directory of your choice followed by creating an empty index.js page that will contain our code:
$ npm init
The next step is to install the required modules from the npm.
From the manual algorithm described above, we see that we'll need something to get the HTML source code, parse the content, and make sense of it and then write down the array of JavaScript objects into the ".CSV" file:
$ npm install --save request request-promise cheerio objects-to-csv
Executing the preceding line in the terminal will install the required modules in the node_modules directory and save them as dependencies in the package.json file.

Retrieving Information

Once all the preparation are done, you can open the index.js file in your favorite text editor and require the modules we've just installed:
const rp = require('request-promise');  
const otcsv = require('objects-to-csv');  
const cheerio = require('cheerio');  
Completing this step in the manual algorithm will give us the link, that we'll split up in two parts and add to the index.js right after the modules:
const baseURL = 'https://www.yellowpages.com';  
const searchURL = 'search?/search_terms=printing&geo_location_terms=New+York%2C+NY';  
Then, we should write a function, that will return the array of JavaScript objects representing the companies from the task description.
In order to be able to translate the manual algorithm into code, we'll first have to do some manual work using the inspector tool of our web browser.
We'll need to find the specific HTML elements that hold the information we're interested in. In our case, the business name could be found at an element like:
<a class="business-name" href="/new-york-ny/mip/global-copy-2988131?lid=1000106255864" itemprop="name">Global Copy</a>  
Which is an <a> tag with the class business-name, that holds the name of the company and an href property that holds a link to the individual company page. Both of them will be useful for us in the future.
Following the link to the company page, we'll need to find the remaining two pieces of data: the phone and the email. They are located under the <p> tags with the class phone and the <a> tag with the class email-business.
Take a note that to get the phone, we'll need the value stored inside of the tag and to get an email, we'll need an href property of the <a> tag.
Let's look what the manual algorithm will look like when we try to implement it programmatically:
  1. Get the HTML source of the page that we're looking to scrape by using the request-promise module and feeding it a link that we got on the first step of our manual algorithm
  2. Map the array of the company names in the original HTML into an array of objects with the properties name, link, phone, and email
  3. Convert the resulting array into a CSV file
Here's how this algorithm will look like when implemented:
const getCompanies = async () => {  
  const html = await rp(baseURL + searchURL);
  const businessMap = cheerio('a.business-name', html).map(async (i, e) => {
    const link = baseURL + e.attribs.href;
    const innerHtml = await rp(link);
    const emailAddress = cheerio('a.email-business', innerHtml).prop('href');
    const name = e.children[0].data;
    const phone = cheerio('p.phone', innerHtml).text();

    return {
      emailAddress,
      link,
      name,
      phone,
    }
  }).get();
  return Promise.all(businessMap);
};
It follows the rules we've set up earlier and returns a Promise that resolves into an array of JavaScript objects:
{
  "emailAddress": "mailto:[email protected]",
  "link": "https://www.yellowpages.com/new-york-ny/mip/global-copy-2988131?lid=1000106255864",
  "name": "Global Copy",
  "phone": "(212) 222-2679"
}

Promise

Just in case you're unfamiliar with the core concepts of the asynchronous programming in modern JavaScript, here's a short introduction to Promises.
A Promise is a special type that acts like a placeholder for the value. It could be in a couple of states:
  • pending
  • fulfilled
  • rejected
Without getting too detailed about it, you should just know that the functions that return promises, don't return the actual values. In order to get access to the result of the Promise or the resolution value, you should write another function, which should be passed into the then block and expect the value that the Promised will be resolved with. If any error occurs inside of the Promise while it's pending, then it will transition into the rejected state with the possibility to handle the error in the catch block.

Storing Data and Finishing Touches

Although the result seems fairly good, taking a closer look at the array of the companies, shows us that the email has an unnecessary mailto: prefix and sometimes the emails and company names are missing:
{
  "emailAddress": undefined,
  "link": "https://www.yellowpages.com/new-york-ny/mip/apm-451909424?lid=451909424",
  "name": undefined,
  "phone": "(212) 856-9800"
}
By getting back to the inspector tool, we see that the name of the company could be also found in the internal company page inside of an <h1> tag.
The emails are sometimes missing for certain companies and we can't do anything about it.
The mailto: prefix could be removed by using a replace function.
Here are the adjustments you should make to your code:
...
    const name = e.children[0].data || cheerio('h1', innerHtml).text();
...
    emailAddress: emailAddress ? emailAddress.replace('mailto:', '') : '',
...
Now, that we have extracted all the necessary data from the web page and have a clean array of JavaScript objects, we can prepare the ".CSV" file required by our task definition:
getCompanies()  
  .then(result => {
    const transformed = new otcsv(result);
    return transformed.toDisk('./output.csv');
  })
  .then(() => console.log('SUCCESSFULLY COMPLETED THE WEB SCRAPING SAMPLE'));
The getCompanies function returns a promise which resolves into an array of objects prepared to be written to the CSV file, which is done in the first then block. The second one resolves when the CSV file is successfully written to the file system, thus completing the task.
By adding this piece of code to our index.js file, and running it with:
node index.js  
We should get an output.csv file right in our working directory. The table represented by this .CSV file contains 4 columns - emailAddress, link, name, phone, and each row describes a single company.

Conclusion

In essence, web scraping is browsing web pages, picking up useful information according to the task, and storing it somewhere, all of which is done so programmatically. In order to be able to do this with code, this process should be first done manually using an inspector tool of the browser or by analyzing the raw HTML content of the target web page.
Node.js provides reliable and simple tools to make web scraping a straightforward task that could save you a lot of time when compared to processing the links manually.
Although it might seem tempting to automate all of your everyday tasks like this, be careful when it comes to making these tools, because they can easily break the law if you don't fully understand the source website's terms or the amount of traffic your web scraper is generating.