Node Js Web Scraping
Web scraping node js example. In this tutorial, You will learn how to use to retrieve data from any websites or web pages using the node js and cheerio.
Aug 01, 2017 What is web scraping? Web scraping is extracting data from a website. Why would someone want to scrape the web? Here are four examples: Scraping social media sites to find trending data; Scraping email addresses from websites that publish public emails; Scraping data from another website to use on your own site; Scraping online stores for sales. What is web scraping in Node.js? In addition to indexing the world wide web, crawling can also be used to gather data. This is known as web scraping. Use cases for web scraping include collecting prices from a retailer’s site or hotel listings from a travel site, scraping email directories for sales leads, and gathering information to train.
Web Scraping Node Js
What is web scraping?
Web scraping is a technique used to retrieve data from websites using a script. Web scraping is the way to automate the laborious work of copying data from various websites.
Web Scraping is generally performed in the cases when the desirable websites don’t expose external API for public consumption. Some common web scraping scenarios are:
- Fetch trending posts on social media sites.
- Fetch email addresses from various websites for sales leads.
- Fetch news headlines from news websites.
For Example, if you may want to scrape medium.com blog post using the following url https://medium.com/search?q=node.js
After that, open the Inspector in chrome dev tools and see the DOM elements of it.
If you see it carefully, it has a pattern. we can scrap it using the element class names.
Web Scraping with Node js and Cheerio
Follow the below steps and retrieve or scrap blog posts data from the medium.com using node js and cheerio:
Step 1: Setup Node js Project
Let’s set up the project to scrape medium blog posts. Create a Project directory.
mkdir nodewebscraper
cd nodewebscraper
npm init --yes
Install all the dependencies mentioned above.
npm install express request cheerio express-handlebars
Step 2: Making Http Request
Firstly, making the http request to get the webpage elements:
request(`https://medium.com/search?q=${tag}`, (err, response, html) => {
//returns all elements of the webpage
})
Step 3: Extract Data From Blog Posts
Once you retrive all the blog posts from medium.com, you can use cheerio to scrap the data that you need:
const $ = cheerio.load(html)
This loads the data to the dollar variable. if you have used JQuery before, you know the reason why we are using $ here(Just to follow some old school naming convention).
Now, you can traverse through the DOM tree.
Since you need only the title and link from scrapped blog posts on your web page. you will get the elements in the HTML using either the class name of it or class name of the parent element.
Firstly, we need to get all the blogs DOM which has .js-block as a class name.
$('.js-block').each((i, el) => {
//This is the Class name for all blog posts DIV.
})
Most Importantly, each keyword loops through all the element which has the class name as js-block.
Next, you scrap the title and link of each blog post from medium.com.
$('.js-block').each((i, el) => {
const title = $(el)
.find('h3')
.text()
const article = $(el)
.find('.postArticle-content')
.find('a')
.attr('href')
let data = {
title,
article,
}
console.log(data)
})
This will scrap the blog posts for a given tag.
The full source code of node js web scraping:
app.js
const cheerio = require('cheerio');
const express = require('express');
const exphbs = require('express-handlebars');
const bodyParser = require('body-parser');
const request = require('request');
const app = express();
app.use(bodyParser.json());
app.use(bodyParser.urlencoded({extended : false}));
app.engine('handlebars', exphbs({ defaultLayout : 'main'}));
app.set('view engine','handlebars');
app.get('/', (req, res) => res.render('index', { layout: 'main' }));
app.get('/search',async (req,res) => {
const { tag } = req.query;
let datas = [];
request(`https://medium.com/search?q=${tag}`,(err,response,html) => {
if(response.statusCode 200){
const $ = cheerio.load(html);
$('.js-block').each((i,el) => {
const title = $(el).find('h3').text();
const article = $(el).find('.postArticle-content').find('a').attr('href');
let data = {
title,
article
}
datas.push(data);
})
}
console.log(datas);
res.render('list',{ datas })
})
})
app.listen(3000,() => {
console.log('server is running on port 3000');
})
Step 4: Create Views
Next, you need to create one folder name layouts, so go to your nodewebscrap app and find views folder and inside this folder create new folder name layouts.
Inside a layout folder, create one views file name main.handlebars and update the following code into your views/layouts/main.handlebars file:
<!DOCTYPE html>
<html>
<head>
<meta charset='UTF-8'>
<meta name='viewport'>
<meta http-equiv='X-UA-Compatible'>
<link href='https://cdnjs.cloudflare.com/ajax/libs/semantic-ui/2.4.1/semantic.min.css'>
<title>Scraper</title>
</head>
<body>
<div>
{{{body}}}
</div>
</body>
</html>
After that, create one new view file name index.handlebars outside the layouts folder.
nodewebscraper/views/index.handlebars
Update the following code into your index.handlerbars:
Node Js Web Scraping Github
<form action='/search'>
<input type='text' name='tag' placeholder='Search...'>
<input type='submit' value='Search'>
</form>
After that, create one new view file name list.handlebars outside the layouts folder.
nodewebscraper/views/list.handlebars
Update the following code into your list.handlerbars:
<div>
{{#each datas}}
<a href='{{article}}'>{{title}}</a>
{{/each}}
</div>
<a href='/'>Back</a>
Step 5: Run development server
npm install
npm run dev
Important Note:
Depending on your usage of these techniques and technologies, your application could be performing illegal actions