Web scraping using javascript - javascript

I want to make a site (with HTML, CSS and Javascript) which will scrape data from other sites. I want to use javascript in order to accomplish that. Which is the best way to do it? I would like to avoid using Node.js or some other framework.

If you are getting cors error just use cors anywhere.
For dom parsing use DomParser
Example:
fetch(`https://cors-anywhere.herokuapp.com/${url}`)
.then(response => response.text())
.then(html => {
const parser = new DOMParser()
const dom = parser.parseFromString(htmlContent, 'text/html')
// now you can select elements like for normal node
dom.querySelector('div')
})
Do you have any other problems?

Related

Web parser in Javascript like BeautifulSoup in Python

I came from Python, where with beautiful soup you can parse the entire html tree without creating get requests in external web pages. I'm looking for the same in javascript, but I've only found jsdom and jssoup (which seems unused) and if I'm correct, they only allow you to make requests.
I want a library in js which allows me to parse the entire html tree without getting CORS policy errors, that is, without making request, just parsing it.
How can I do this?
In a browser context, you can use DOMParser:
const html = "<h1>title</h1>";
const parser = new DOMParser();
const parsed = parser.parseFromString(html, "text/html");
console.log(parsed.firstChild.innerText); // "title"
and in node you can use node-html-parser:
import { parse } from 'node-html-parser';
const html = "<h1>title</h1>";
const parsed = parse(html);
console.log(parsed.firstChild.innerText); // "title"

How to get all links from a website with puppeteer

Well, I would like a way to use the puppeteer and the for loop to get all the links on the site and add them to an array, in this case the links I want are not links that are in the html tags, they are links that are directly in the source code, javascript file links etc... I want something like this:
array = [ ]
for(L in links){
array.push(L)
//The code should take all the links and add these links to the array
}
But how can I get all references to javascript style files and all URLs that are in the source code of a website?
I just find a post and a question that teaches or shows how it gets the links from the tag and not all the links from the source code.
Supposing you want to get all the tags on this page for example:
view-source:https://www.nike.com/
How can I get all script tags and return to console? I put view-source:https://nike.com because you can get the script tags, I don't know if you can do it without displaying the source code, but I thought about displaying and getting the script tag because that was the idea I had, however I do not know how to do it
It is possible to get all links from a URL using only node.js, without puppeteer:
There are two main steps:
Get the source code for the URL.
Parse the source code for links.
Simple implementation in node.js:
// get-links.js
///
/// Step 1: Request the URL's html source.
///
axios = require('axios');
promise = axios.get('https://www.nike.com');
// Extract html source from response, then process it:
promise.then(function(response) {
htmlSource = response.data
getLinksFromHtml(htmlSource);
});
///
/// Step 2: Find links in HTML source.
///
// This function inputs HTML (as a string) and output all the links within.
function getLinksFromHtml(htmlString) {
// Regular expression that matches syntax for a link (https://stackoverflow.com/a/3809435/117030):
LINK_REGEX = /https?:\/\/(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&//=]*)/gi;
// Use the regular expression from above to find all the links:
matches = htmlString.match(LINK_REGEX);
// Output to console:
console.log(matches);
// Alternatively, return the array of links for further processing:
return matches;
}
Sample usage:
$ node get-links.js
[
'http://www.w3.org/2000/svg',
...
'https://s3.nikecdn.com/unite/scripts/unite.min.js',
'https://www.nike.com/android-icon-192x192.png',
...
'https://connect.facebook.net/',
... 658 more items
]
Notes:
I used the axios library for simplicity and to avoid "access denied" errors from nike.com. It is possible to use any other method to get the HTML source, like:
Native node.js http/https libraries
Puppeteer (Get complete web page source html with puppeteer - but some part always missing)
Although the other answers are applicable in many situations, they will not work for client-side rendered sites. For instance, if you just do an Axios request to Reddit, all you'll get is a couple of divs with some metadata. As Puppeteer actually gets the page and parses all JavaScript in a real browser, the websites' choice of document rendering becomes irrelevant for extracting page data.
Puppeteer has an evaluate method on the page object which allows you to run JavaScript directly on the page. Using that, you easily extract all links as follows:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
const pageUrls = await page.evaluate(() => {
const urlArray = Array.from(document.links).map((link) => link.href);
const uniqueUrlArray = [...new Set(urlArray)];
return uniqueUrlArray;
});
console.log(pageUrls);
await browser.close();
})();
yes you can get all the script tags and their links without opening view source.
You need to add dependency for jsdom library in your project and then pass the HTML response to its instance like below
here is the code:
const axios = require('axios');
const jsdom = require("jsdom");
// hit simple HTTP request using axios or node-fetch as you wish
const nikePageResponse = await axios.get('https://www.nike.com');
// now parse this response into a HTML document using jsdom library
const dom = new jsdom.JSDOM(nikePageResponse.data);
const nikePage = dom.window.document
// now get all the script tags by querying this page
let scriptLinks = []
nikePage.querySelectorAll('script[src]').forEach( script => scriptLinks.push(script.src.trim()));
console.debug('%o', scriptLinks)
Here I have made CSS selector for <script> tags that have src attribute inside them.
You can write same code in using puppeteer, but it will take time opening the browser and everything and then getting its pageSource.
you can use this to find the links and then do whatever you want to use with them using puppeteer or anything.

How to get another html file as if getting it by document without using jquery

How to get another html file as if getting it by document
Basically, the same 'document' instance type used for document.getElementsByClassName(), but instead of getting the document the javascript code is in, it gets another .html file in the same domain. For example, "blog.html".
Here's what I want it to hypothetically look like
var blogdocument = getDocument("blog.html");
var blogposts = blogdocument.getElementsByClassName("blogpost");
If it's on the same domain, you can make a network request to get the text of the response back, then send it through DOMParser to construct a document from the text:
fetch('./blog.html')
.then(res => res.text())
.then((result) => {
const doc = new DOMParser().parseFromString(result, 'text/html');
const posts = doc.querySelectorAll('.blogpost');
// ...
})
// .catch(handleErrors);
Just wanted to add some suggestions to the answer
const doc = new DOMParser();
You may not want to use const
Because you should use
doc = null;
to actively release memory after you done with it
DOMParser is very expensive
I have encountered serious performance issues because I didn't do it
DOM Parser Chrome extension memory leak
An alternative to a fetch or AJAX request is to load an iframe with the source set to the URL, then read from its contentWindow, and then remove it from the DOM.
var desiredPage = "www.google.com";
var ifrm = document.createElement("iframe");
ifrm.setAttribute("src", desiredPage);
document.body.appendChild(ifrm);
setTimeout(function() {
console.log(ifrm.contentWindow.document.getElementsByTagName("somestuff"));
ifrm.remove();
}, 100);

Do we have to generate Angular Web Workers in a separate file?

The Angular documentation says to do this:
ng generate web-worker location
And that works great. Just curious whether we have to generate the worker in a separate file or can we just create one a service:
const worker = new Worker()
worker.addEventListener('message', ({ data }) => {
const response = `worker response to ${data}`;
postMessage(response);
});
Thoughts?
It's not due to angular, it is due to Worker itself: yes, you are forced to use a separate file.
Hope this helps.

Access github html file using JavaScript

I have this html file on github, which I want to access using JavaScript. I tried this, but it didn't work: (I'm using p5.js so the setup function is basically the onload function)
var htmlfile = "[URL THAT POINTS TO HTML FILE]";
function setup() {
console.log(htmlfile.getElementById('id'))
}
Is there any way to do this? Preferably i would like that only plain JavaScript and p5 will be used.
As said in the comments, sending a request to the raw github page will probably be the best way to get the html you want. Here is an example using fetch
document.addEventListener('DOMContentLoaded', getHTML);
function getHTML() {
fetch('https://gist.githubusercontent.com/chrisvfritz/bc010e6ed25b802da7eb/raw/18eaa48addae7e3021f6bcea03b7a6557e3f0132/index.html')
.then((res) => {
return res.text();
})
.then((data) => {
document.write(data);
// get the p tag from the remote html file now we have the html in our document
var firstParagraph = document.getElementsByTagName("p")[0];
console.log(firstParagraph.textContent);
})
}

Categories