Recently, a relative of mine has been struggling with the tedious work of constantly screenshotting product images from this website for her job. I thought to myself, "isn't a web scraping script gonna solve this problem easily?" With that in mind, I quickly put together a web scrapping program in Javascript.
I've only done web scraping in Java before, but considering that it is impossible to export jar files, I decided to try it with Javascript. After doing a little bit of research, I decided to use this module called "puppeteer".
const puppeteer = require("puppeteer"); // include lib const mkdirp = require("mkdirp"); const fs = require("fs"); (async () => { // declare function const browser = await puppeteer.launch(); // run browser const page = await browser.newPage(); // open new tab await page.goto(process.argv[2]); // go to site. //process.argv[2] is the second element in the terminal command. The argument passed in from terminal will directly passed to the node command. await page.waitForSelector("#largeImageParentBox > div.large-image-clear"); // wait for the selector to load const element = await page.$("#largeImageParentBox > div.large-image-clear"); // declare a variable with an ElementHandle const name_element = await page.$( "#product-detail-wrap > div.product-setinfo-wrap > div > div.product-name-wrap > div > dl:nth-child(1) > dd > a" ); const name = await page.evaluate((el) => el.textContent, name_element); !fs.existsSync("./" + name) && fs.mkdirSync("./" + name); //make folder with the name of the product await element.screenshot({ path: "./" + name + "/image.png" }); // take screenshot element in puppeteer await browser.close(); // close browser })();
What I did was basically screenshotting the image at
"#largeImageParentBox > div.large-image-clear"
and then putting that image in a folder with the name of the product at
"#product-detail-wrap > div.product-setinfo-wrap > div > div.product-name-wrap > div > dl:nth-child(1) > dd > a"
These strings indicating the locations of the elements are called "selector paths". Some people use XPath but it's easier, in this case, using jQuery commands. In addition, web scrapings are often done in asynchronous functions since, just like multiple threads, it can process data faster with commands in parallel.
Normally, to run this script, I would enter "node
#!/bin/bash echo $1 sudo node scraper.js $1
In the terminal, the user can obtain images from the website by directly running the shell script. The user can run the shell script like this: "./command.sh www.thewebsite.com". The "$1" represents the second element in the terminal command. More information about passing arguments in the terminal can be found here.