Read data from url in javascript - javascript

i'm trying to read data stored in google drive, on drive or other cloud storage using javascript
`
let btn = document.getElementById('action-btn');
btn.addEventListener('click', () => {
// let baseurl = document.getElementById('inputurl').value;
// let guidDcoument = baseurl.split('#guid=')[1];
// const query = encodeURIComponent('Select *')
// fetch('', {
// mode: "no-cors",
// method: "get",
// }).then(Response => console.log(Response))
fetch('https://docs.google.com/spreadsheets/d/1kwfv6L2lBrPw8OjHGyhO7YHOXFNwHYyPI_noM5TUMLw/edit?pli=1#gid=1097732605',{
mode:"no-cors"
})
.then((response) => { console.log(Response.error); })
.catch(console.error())
})
`
What I need is that given a url, I can read the data from the file, and then show them on my website.
I think that when I try to access any other cloud storage I will have the same problem. That it will be necessary to access the account to be able to read the data that would be a csv document.

First of all, what you're trying to achieve is called 'web scraping' and you can't use fetch for that, instead you should use puppeteer (in the server side), which is the most popular library for web scraping.
Run npm init -y to initialize a npm project and install puppeteer npm i puppeteer and also install express and cors npm i express cors in order to create a server that scraps data and sets it back to your client. So, instead of trying to scrap the information directly from the client you do it from the server with puppeteer.
Try the following .js server code:
const express = require('express')
const cors = require('cors')
const puppeteer = require('puppeteer')
const app = express()
app.use(express.json())
app.use(cors())
;(async () => {
const browser = await puppeteer.launch({headless: false})
const page = await browser.newPage()
app.use('/', async (req, res) => {
await page.goto('url-to-the-website')
const data = {}
return res.json(data)
})
})()
app.listen(3000)
And learn more about puppeteer: https://pptr.dev/.
And your client code should connect to this server to send scrape requests to it like this:
(async () => {
const res = await fetch('http://localhost:3000')
const rawjson = await res.json()
const json = JSON.parse(rawjson)
console.log(json)
})
Note: we wrap the code in anonymous functions with the reserved word async in order to use async and await syntax. More information: https://javascript.info/async-await.

Related

Scrape multiple domains with axios, cheerio and handlebars on node js

I am trying to make a webscraper, that outputs certain data from node js into the javascript, or html file im working on. Its important that the data of multiple sub pages can be scraped (that I have no code access to) and be displayed in the same html or js file. The problem is that I cant output the results I get from the axios function into global. If i could my problem would be solved.
So far I have been trying to use axios to get the data I need and cheerio to modify it. I created a const named "articles" where I pushed in every title I needed from the website im scraping.
const axios = require('axios')
const cheerio = require('cheerio')
const express = require('express')
const hbs = require('hbs')
const url = 'https://www.google.com/'
const articles = []
axios(url)
.then(response => {
const html = response.data
const $ = cheerio.load(html)
$('.sprite', html).parent().children('a').each(function() {
const text = $(this).attr('title')
articles.push({
text
})
})
console.log(articles)
const finalArray = articles.map(a => a.text);
console.log(finalArray)
}).catch(err => console.log(err))
That works well so far. If I ouput the finalArray I get the array I want to. But once im outside of the axios function the array is empty. Only way it worked for me is when I put the following code inside the axios function, but in this case I wont be able to scrape multiple websides.
console.log(finalArray) //outputs empty array
// with this function I want to get the array displayed in my home.hbs file.
app.get('/', function(req, res){
res.render('views/home', {
array: finalArray
})
})
Basicly all I need is to get the finalArray into global so I can use it in the app.get function to render the Website with the scraped data.
There are two cases here. Either you want to re-run your scraping code on each request, or you want to run the scraping code once when the app starts and re-use the cached result.
New request per request:
const axios = require("axios");
const cheerio = require("cheerio");
const express = require("express");
const scrape = () =>
axios
.get("https://www.example.com")
.then(({data}) => cheerio.load(data)("h1").text());
express()
.get("/", (req, res) => {
scrape().then(text => res.json({text}));
})
.listen(3000);
Up-front, one-off request:
const scrapingResultP = axios
.get("https://www.example.com")
.then(({data}) => cheerio.load(data)("h1").text());
express()
.get("/", (req, res) => {
scrapingResultP.then(text => res.json({text}));
})
.listen(3000);
Result:
$ curl localhost:3000
{"text":"Example Domain"}
It's also possible to do a one-off request without a callback or promise that uses a race condition to populate a variable in scope of the request handlers as well as the scraping response handler. Realistically, the server should be up by the time the request resolves, though, so it's common to see this:
let result;
axios
.get("https://www.example.com")
.then(({data}) => (result = cheerio.load(data)("h1").text()));
express()
.get("/", (req, res) => {
res.json({text: result});
})
.listen(3000);
Eliminating the race by chaining your Express routes and listener from the axios response handler:
axios.get("https://www.example.com").then(({data}) => {
const text = cheerio.load(data)("h1").text();
express()
.get("/", (req, res) => {
res.json({text});
})
.listen(3000);
});
If you have multiple requests you need to complete before you start the server, try Promise.all. Top-level await or an async IIFE can work too.
Error handling has been left as an exercise.
Problem has been resolved. I used this code, instead of the normal axios.get(url) function:
axios.all(urls.map((endpoint) => axios.get(endpoint))).then(
axios.spread(({data:user}, {data:repos}) => {
with "user", and "repos" I am now able to enter both URL data and can execute code regarding the URL i like to chose in that one function.

web scraping for html page but need for repeat on lots link?

I wrote the following code for parse some part of HTML for one URL. I means parse page const URL= 'https://www.example.com/1'
Now I want to parse the next page 'https://www.example.com/2' and so on. so I want to implement a For-Loop manner here.
what is the easiest way that I can use the iteration manner here to
change URL (cover page 1,2,3, ...) automatically and run this code in repeat to parse other pages? How I can use for-loop manner here?
const PORT = 8000
const axios = require('axios')
const cheerio = require('cheerio')
const express = require('express')
const app = express()
const cors = require('cors')
app.use(cors())
const url = 'https://www.example.com/1'
app.get('/', function (req, res) {
res.json('This is my parser')
})
app.get('/results', (req, res) => {
axios(url)
.then(response => {
const html = response.data
const $ = cheerio.load(html)
const articles = []
$('.fc-item__title', html).each(function () {
const title = $(this).text()
const url = $(this).find('a').attr('href')
articles.push({
title,
url
})
})
res.json(articles)
}).catch(err => console.log(err))
})
app.listen(PORT, () => console.log(`server running on PORT ${PORT}`))
Some considerations, if you added CORS to your app, so that you can GET the data, it's useless, you add CORS when you want to SEND data, when your app is going to receive requests, CORS enable other people to use your app, it's useless then trying to use other people's app. And CORS problems happen only in the browser, as node is on the server, it will never get CORS error.
The first problem with your code, is that https://www.example.com/1, even working on the browser, returns 404 Not Found Error to axios, because this page really doesn't exist, only https://www.example.com would work.
I added an example using the comic site https://xkcd.com/ that accepts pages.
I added each axios request to an array of promises, then used Promise.all to wait for all of them:
The code is to get the image link:
const PORT = 8000;
const axios = require("axios");
const cheerio = require("cheerio");
const express = require("express");
const app = express();
const url = "https://xkcd.com/";
app.get("/", function (req, res) {
res.json("This is my parser");
});
let pagesToScrap = 50;
app.get("/results", (req, res) => {
const promisesArray = [];
for (let pageNumber = 1; pageNumber <= pagesToScrap; pageNumber++) {
let promise = new Promise((resolve, reject) => {
axios(url + pageNumber)
.then((response) => {
const $ = cheerio.load(response.data);
let result = $("#transcript").prev().html();
resolve(result);
})
.catch((err) => reject(err));
});
promisesArray.push(promise);
}
Promise.all(promisesArray)
.then((result) => res.json(result))
.catch((err) => {
res.json(err);
});
});
app.listen(PORT, () => console.log(`server running on PORT ${PORT}`));

NodeJS API response fails after fs.writefile

I've created an API which I want to create a file, and after the file was written, request a log API and after its response, response relatively to the user.
I've simplified the code like this:
const express = require('express');
const router = express.Router();
const fetch = require("node-fetch");
const util = require('util');
const fs = require("fs-extra")
router.get('/works/', (req, res) => {
logData(res)
})
router.get('/fails/', (req, res) => {
let t = Date.now();
const writeFile = util.promisify(fs.writeFile)
writeFile(`./data/${t}.json`, 'test').then(function(){
logData(res)
})
})
function logData(res) {
return new Promise(resolve => {
fetch('https://webhook.site/44dad1a5-47f6-467b-9088-346e7222d7be')
.then(response => response.text())
.then(x => res.send('x'));
});
}
module.exports = router
The /works/ API works fine,
but the /fails/ API fails with Error: read ECONNRESET
OP clarified in the comments that he uses nodemon to run this code.
The problem is that nodemon watches .json files too and restarts the server. So the request that changes a JSON file fails with Error: read ECONNRESET.
To prevent nodemon from restarting server when you change .json files see this.
For example, you can add nodemon.json configuration file to ignore ./data directory (make sure to restart nodemon after this file is added):
{
"ignore": ["./data"]
}

Using Puppeteer to screenshot an EJS template with NodeJS and Express

I have NodeJs/Express app in which I would like to open new browser window and render local EJS view into it. I am trying to do it using Puppeteer.
const puppeteer = require('puppeteer');
router.post('/new_window', async (req, res) => {
try {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
const pageContent = ejs.render('../views/mypage.ejs', {})
await page.setContent(pageContent)
//await page.screenshot({path: 'example.png'});
// await browser.close();
} catch (err) {
res.status(500)
console.log(err)
res.send(err.message)
}
})
In the browser instead of page layout I get:
../views/mypage.ejs
Instead of:
await page.goto(...); // This code is acting like your browser's address bar
Try
const pageContent = ejs.render('../views/mypage.ejs', {data to populate your .ejs page}) //This is sudo code. Check ejs docs on how to do this
await page.setContent(pageContent)
The code above will let you create your page on your server.
With page.setContent(..)you can load any string of HTML.
OP made an edit that correctly uses page.setContent rather than page.goto, however, there's still an issue. ejs.render() is used to run EJS on a template in string form, so it's treating the file path as the template itself. If you want to read the file into a string first (possibly when your app starts, if the template never changes), ejs.render() will work.
The other approach is to use the EJS method that accepts a path, ejs.renderFile(). Here's a minimal example showing the usage:
const ejs = require("ejs"); // 3.1.8
const express = require("express"); // ^4.18.1
const puppeteer = require("puppeteer"); // ^19.1.0
express()
.get("/greet", (req, res) => {
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
const html = await ejs.renderFile("greet.ejs", {name: "world"});
await page.setContent(html);
const buf = await page.screenshot();
res.contentType("image/png");
res.send(buf);
})()
.catch(err => {
console.error(err);
res.sendStatus(500);
})
.finally(() => browser?.close());
})
.listen(3000);
Where greet.ejs contains:
<!DOCTYPE html>
<html>
<body>
<h1>Hello, <%- name %></h1>
</body>
</html>
To make a PDF with Express, EJS and Puppeteer, see Express/Puppeteer: generate PDF from EJS template and send as response.
To reuse the browser across routes, see this answer for a possible approach.

Why is the fetch function saying, that I have to use absolute urls, even if I have set a proxy?

At the moment I am coding a Shopify application. I want to fetch all the products from my store in server.js but every time it outputs a message, that says that only absolute urls are supported. A registered Webhook should get all the products inside my shop.
Error: only absolute urls are supported
Here is my javascript (server.js)
const { default: proxy } = require('#shopify/koa-shopify-graphql-proxy');
const { ApiVersion } = require('#shopify/koa-shopify-graphql-proxy');
app.prepare().then(() => {
const server = new Koa();
const router = new Router();
server.use(session(server));
server.keys = [/** Shopify Keys */];
server.use(
createShopifyAuth({
/**
* Webhook
*/
}),
);
const webhook = receiveWebhook({ secret: SHOPIFY_API_SECRET_KEY });
router.post('/webhooks/products/create', webhook, async (ctx) => {
await fetch('/graphql', {
credentials: 'include',
body: allProducts
})
.then((data) => {
console.log(data)
})
.catch((err) => {
console.log(err)
})
console.log('received Webhook: ', ctx.state.webhook);
})
server.use(router.allowedMethods());
server.use(router.routes());
console.log(proxy({ version: ApiVersion.Unstable }))
server.use(proxy({ version: ApiVersion.Unstable }))
server.listen(port, () => {
console.log(`> Ready on localhost:${port}`)
})
})
I was using the example from the npm package shopify koa proxy link here
How can I send http request with the proxy I am using?
The issue is exactly what the error says, fetch requires absolute urls.
Whether you have a proxy or not is really irrelevant to the fetch api, it doesn't know about that.
Just give it an absolute URL

Categories