Scrape multiple domains with axios, cheerio and handlebars on node js - javascript

I am trying to make a webscraper, that outputs certain data from node js into the javascript, or html file im working on. Its important that the data of multiple sub pages can be scraped (that I have no code access to) and be displayed in the same html or js file. The problem is that I cant output the results I get from the axios function into global. If i could my problem would be solved.
So far I have been trying to use axios to get the data I need and cheerio to modify it. I created a const named "articles" where I pushed in every title I needed from the website im scraping.
const axios = require('axios')
const cheerio = require('cheerio')
const express = require('express')
const hbs = require('hbs')
const url = 'https://www.google.com/'
const articles = []
axios(url)
.then(response => {
const html = response.data
const $ = cheerio.load(html)
$('.sprite', html).parent().children('a').each(function() {
const text = $(this).attr('title')
articles.push({
text
})
})
console.log(articles)
const finalArray = articles.map(a => a.text);
console.log(finalArray)
}).catch(err => console.log(err))
That works well so far. If I ouput the finalArray I get the array I want to. But once im outside of the axios function the array is empty. Only way it worked for me is when I put the following code inside the axios function, but in this case I wont be able to scrape multiple websides.
console.log(finalArray) //outputs empty array
// with this function I want to get the array displayed in my home.hbs file.
app.get('/', function(req, res){
res.render('views/home', {
array: finalArray
})
})
Basicly all I need is to get the finalArray into global so I can use it in the app.get function to render the Website with the scraped data.

There are two cases here. Either you want to re-run your scraping code on each request, or you want to run the scraping code once when the app starts and re-use the cached result.
New request per request:
const axios = require("axios");
const cheerio = require("cheerio");
const express = require("express");
const scrape = () =>
axios
.get("https://www.example.com")
.then(({data}) => cheerio.load(data)("h1").text());
express()
.get("/", (req, res) => {
scrape().then(text => res.json({text}));
})
.listen(3000);
Up-front, one-off request:
const scrapingResultP = axios
.get("https://www.example.com")
.then(({data}) => cheerio.load(data)("h1").text());
express()
.get("/", (req, res) => {
scrapingResultP.then(text => res.json({text}));
})
.listen(3000);
Result:
$ curl localhost:3000
{"text":"Example Domain"}
It's also possible to do a one-off request without a callback or promise that uses a race condition to populate a variable in scope of the request handlers as well as the scraping response handler. Realistically, the server should be up by the time the request resolves, though, so it's common to see this:
let result;
axios
.get("https://www.example.com")
.then(({data}) => (result = cheerio.load(data)("h1").text()));
express()
.get("/", (req, res) => {
res.json({text: result});
})
.listen(3000);
Eliminating the race by chaining your Express routes and listener from the axios response handler:
axios.get("https://www.example.com").then(({data}) => {
const text = cheerio.load(data)("h1").text();
express()
.get("/", (req, res) => {
res.json({text});
})
.listen(3000);
});
If you have multiple requests you need to complete before you start the server, try Promise.all. Top-level await or an async IIFE can work too.
Error handling has been left as an exercise.

Problem has been resolved. I used this code, instead of the normal axios.get(url) function:
axios.all(urls.map((endpoint) => axios.get(endpoint))).then(
axios.spread(({data:user}, {data:repos}) => {
with "user", and "repos" I am now able to enter both URL data and can execute code regarding the URL i like to chose in that one function.

Related

Read data from url in javascript

i'm trying to read data stored in google drive, on drive or other cloud storage using javascript
`
let btn = document.getElementById('action-btn');
btn.addEventListener('click', () => {
// let baseurl = document.getElementById('inputurl').value;
// let guidDcoument = baseurl.split('#guid=')[1];
// const query = encodeURIComponent('Select *')
// fetch('', {
// mode: "no-cors",
// method: "get",
// }).then(Response => console.log(Response))
fetch('https://docs.google.com/spreadsheets/d/1kwfv6L2lBrPw8OjHGyhO7YHOXFNwHYyPI_noM5TUMLw/edit?pli=1#gid=1097732605',{
mode:"no-cors"
})
.then((response) => { console.log(Response.error); })
.catch(console.error())
})
`
What I need is that given a url, I can read the data from the file, and then show them on my website.
I think that when I try to access any other cloud storage I will have the same problem. That it will be necessary to access the account to be able to read the data that would be a csv document.
First of all, what you're trying to achieve is called 'web scraping' and you can't use fetch for that, instead you should use puppeteer (in the server side), which is the most popular library for web scraping.
Run npm init -y to initialize a npm project and install puppeteer npm i puppeteer and also install express and cors npm i express cors in order to create a server that scraps data and sets it back to your client. So, instead of trying to scrap the information directly from the client you do it from the server with puppeteer.
Try the following .js server code:
const express = require('express')
const cors = require('cors')
const puppeteer = require('puppeteer')
const app = express()
app.use(express.json())
app.use(cors())
;(async () => {
const browser = await puppeteer.launch({headless: false})
const page = await browser.newPage()
app.use('/', async (req, res) => {
await page.goto('url-to-the-website')
const data = {}
return res.json(data)
})
})()
app.listen(3000)
And learn more about puppeteer: https://pptr.dev/.
And your client code should connect to this server to send scrape requests to it like this:
(async () => {
const res = await fetch('http://localhost:3000')
const rawjson = await res.json()
const json = JSON.parse(rawjson)
console.log(json)
})
Note: we wrap the code in anonymous functions with the reserved word async in order to use async and await syntax. More information: https://javascript.info/async-await.

web scraping for html page but need for repeat on lots link?

I wrote the following code for parse some part of HTML for one URL. I means parse page const URL= 'https://www.example.com/1'
Now I want to parse the next page 'https://www.example.com/2' and so on. so I want to implement a For-Loop manner here.
what is the easiest way that I can use the iteration manner here to
change URL (cover page 1,2,3, ...) automatically and run this code in repeat to parse other pages? How I can use for-loop manner here?
const PORT = 8000
const axios = require('axios')
const cheerio = require('cheerio')
const express = require('express')
const app = express()
const cors = require('cors')
app.use(cors())
const url = 'https://www.example.com/1'
app.get('/', function (req, res) {
res.json('This is my parser')
})
app.get('/results', (req, res) => {
axios(url)
.then(response => {
const html = response.data
const $ = cheerio.load(html)
const articles = []
$('.fc-item__title', html).each(function () {
const title = $(this).text()
const url = $(this).find('a').attr('href')
articles.push({
title,
url
})
})
res.json(articles)
}).catch(err => console.log(err))
})
app.listen(PORT, () => console.log(`server running on PORT ${PORT}`))
Some considerations, if you added CORS to your app, so that you can GET the data, it's useless, you add CORS when you want to SEND data, when your app is going to receive requests, CORS enable other people to use your app, it's useless then trying to use other people's app. And CORS problems happen only in the browser, as node is on the server, it will never get CORS error.
The first problem with your code, is that https://www.example.com/1, even working on the browser, returns 404 Not Found Error to axios, because this page really doesn't exist, only https://www.example.com would work.
I added an example using the comic site https://xkcd.com/ that accepts pages.
I added each axios request to an array of promises, then used Promise.all to wait for all of them:
The code is to get the image link:
const PORT = 8000;
const axios = require("axios");
const cheerio = require("cheerio");
const express = require("express");
const app = express();
const url = "https://xkcd.com/";
app.get("/", function (req, res) {
res.json("This is my parser");
});
let pagesToScrap = 50;
app.get("/results", (req, res) => {
const promisesArray = [];
for (let pageNumber = 1; pageNumber <= pagesToScrap; pageNumber++) {
let promise = new Promise((resolve, reject) => {
axios(url + pageNumber)
.then((response) => {
const $ = cheerio.load(response.data);
let result = $("#transcript").prev().html();
resolve(result);
})
.catch((err) => reject(err));
});
promisesArray.push(promise);
}
Promise.all(promisesArray)
.then((result) => res.json(result))
.catch((err) => {
res.json(err);
});
});
app.listen(PORT, () => console.log(`server running on PORT ${PORT}`));

Why downloading a file from node.js server multiple times results in empty files

I am glad to get some help.
Here is my problem:
I have built a web server with node.js that should send a csv file to the client when requested through a certain route. The csv file is created from json using the fast-csv package. The json data comes from a mongoDB and is processed with mongoose.
When I request this route once, it works fine. However, when it is requested a second time, an empty file is sent to the client. By the way, the headers reach the client correctly.
I have to restart the server to download my file again.
What did I try:
Basically, I have now lost track of everything I have tried. This behavior occurs both when using postman and when querying via the browser.
I've tried implementing promises in my handler function.
I've tried to unsubscribe
res somehow (but yes, that was a stupid approach).
I`ve tried to write the file into the fs and to send it on a second request. ...
Maybe one of you can tell what's going wrong here at first glance:
const { format } = require("#fast-csv/format");
const csvStream = format({ delimiter: ";", headers: true });
const router = express.Router();
router.route("/:collection/csv").get(requireModel, createCsv);
const csvFromDatabase = (data, res) => {
csvStream.pipe(res);
const processData = (data) => {
data.forEach((row) => {
const { _id, __v, ...newRow } = row._doc;
csvStream.write({ ...newRow });
});
csvStream.end();
};
processData(data);
};
const createCsv = async (req, res) => {
const { model } = req;
const items = await model.find({});
res.setHeader("Content-disposition", "attachment; filename=file.csv");
res.setHeader("Content-type", "text/html; charset=UTF-8");
csvFromDatabase(items, res);
};
Thank you very much for your help. I hope I didn't bore you with too stupid questions.
You need to recreate csvStream for each new request:
const csvFromDatabase = (data, res) => {
const csvStream = format({ delimiter: ";", headers: true });
csvStream.pipe(res);
…
};

Modifying data from API and save in postgreSQL

I want to fetch some data from a public API and then modify this data. For example, I will add a Point(long, lat) in geography in Postgis. However, I need to fetch the data from a public API before I get there. I have tried this so far, but it doesn't seem like the logic makes sense.
Inserting data into the database works fine, and I have set it up correctly. However, the problems happen when I try to do it in a JSON function.
require('dotenv').config()
const express = require('express')
const fetch = require('node-fetch');
const app = express();
const db = require("./db");
app.use(express.json());
async function fetchDummyJSON(){
fetch('https://jsonplaceholder.typicode.com/todos/1')
.then(res => res.json())
.then((json) => {
await db.query("INSERT INTO logictest(userId,id,title,completed) values($1,$2,$3,$4)",[json.userId+1,json.id,json.title,json.completed])
});
}
fetchDummyJSON()
app.get('/', (req, res) => {
res.send('Hello World!')
});
const port = process.env.PORT || 3001
app.listen(port, () => {
console.log(`Example app listening on port ${port}`)
})
I keep getting SyntaxError: await is only valid in async functions and the top-level bodies of modules; however, fetchDummyData() is an async function from what I can tell. Is there a way to make more sense or make this work? I am going for a PERN stack.
This is a function as well:
.then((json) => {
await db.query("INSERT INTO logictest(userId,id,title,completed) values($1,$2,$3,$4)",[json.userId+1,json.id,json.title,json.completed])
});
try
.then(async (json) => {
await db.query("INSERT INTO logictest(userId,id,title,completed) values($1,$2,$3,$4)",[json.userId+1,json.id,json.title,json.completed])
});

Fetch data from API in Node.js and send to Reactjs

I am trying to fetch data from an API in Nodejs and then send the data to Reactjs. I can't directly fetch data in Reactjs as I need to provide the API key so for this, I am using Nodejs. I am using Axios to fetch the data. The problem is that since Axios sends the data only after it has completely fetched all the data, it takes more than 3 4 seconds to display the data which is not very good. I want to know how can I display data after every interval or as soon as Axios fetches some data, keep displaying that and loading the rest part simultaneously. My code at the backend part is
const express = require('express');
const axios = require('axios');
const bodyParser = require('body-parser');
const cors = require('cors');
const app = express();
app.use(cors());
app.use(express.json());
app.use(bodyParser.urlencoded({ extended: true }));
app.get('/getdata', function (req, res) {
const fetchData = async () => {
const data = await axios(`${APIurl}?${APIKey}`)
res.send(data.data);
}
fetchData();
})
app.listen(5000);
The code for the Reactjs part is like this.
import React, { useState, useEffect } from 'react'
import axios from 'axios';
const Projects = () => {
const [data, updateData] = useState([]);
useEffect(() => {
axios.get('http://localhost:5000/getdata')
.then(res => updateData(res.data))
.catch(error => console.log("Error"))
}, [])
return (
{data.map(project => (
<div key={project["id"]}>
<div>{project["name"]}</div>
</div>
))}
)
}
export default Projects
So how can I send data from backend code so that it displays data after every some particular interval?
Seems like a good use case for WebSocket or maybe WebSocketStream API.
The implementation details will depend greatly on whether you want to implement any of these yourself or if you want to use some library/service (for instance, socket.io) and which of these 2 APIs suits best to your project.

Categories