Simple Web Scrapping with Node Js Using Shell Script

0 likes

Recently, a relative of mine has been struggling with the tedious work of constantly screenshotting product images from this website for her job. I thought to myself, "isn't a web scraping script gonna solve this problem easily?" With that in mind, I quickly put together a web scrapping program in Javascript.

I've only done web scraping in Java before, but considering that it is impossible to export jar files, I decided to try it with Javascript. After doing a little bit of research, I decided to use this module called "puppeteer".

const puppeteer = require("puppeteer"); // include lib
const mkdirp = require("mkdirp");
const fs = require("fs");

(async () => {
  // declare function
  const browser = await puppeteer.launch(); // run browser
  const page = await browser.newPage(); // open new tab
  await page.goto(process.argv[2]); // go to site.
//process.argv[2] is the second element in the terminal command. The argument passed in from terminal will directly passed to the node command.

  await page.waitForSelector("#largeImageParentBox > div.large-image-clear"); // wait for the selector to load
  const element = await page.$("#largeImageParentBox > div.large-image-clear"); // declare a variable with an ElementHandle
  const name_element = await page.$(
    "#product-detail-wrap > div.product-setinfo-wrap > div > div.product-name-wrap > div > dl:nth-child(1) > dd > a"
  );
  const name = await page.evaluate((el) => el.textContent, name_element);
  !fs.existsSync("./" + name) && fs.mkdirSync("./" + name); //make folder with the name of the product
  await element.screenshot({ path: "./" + name + "/image.png" });  // take screenshot element in puppeteer
  await browser.close(); // close browser
})();

What I did was basically screenshotting the image at

"#largeImageParentBox > div.large-image-clear"

and then putting that image in a folder with the name of the product at

"#product-detail-wrap > div.product-setinfo-wrap > div > div.product-name-wrap > div > dl:nth-child(1) > dd > a"

These strings indicating the locations of the elements are called "selector paths". Some people use XPath but it's easier, in this case, using jQuery commands. In addition, web scrapings are often done in asynchronous functions since, just like multiple threads, it can process data faster with commands in parallel.

Normally, to run this script, I would enter "node " with node js. However, in order to make it more usable, I created a shell script, using which the user can pass in an argument directly from the terminal.

#!/bin/bash
echo $1
sudo node scraper.js $1

In the terminal, the user can obtain images from the website by directly running the shell script. The user can run the shell script like this: "./command.sh www.thewebsite.com". The "$1" represents the second element in the terminal command. More information about passing arguments in the terminal can be found here.


Leave a Comment
/200 Characters