Web parser in Javascript like BeautifulSoup in Python

Web parser in Javascript like BeautifulSoup in Python - javascript

I came from Python, where with beautiful soup you can parse the entire html tree without creating get requests in external web pages. I'm looking for the same in javascript, but I've only found jsdom and jssoup (which seems unused) and if I'm correct, they only allow you to make requests.
I want a library in js which allows me to parse the entire html tree without getting CORS policy errors, that is, without making request, just parsing it.
How can I do this?

In a browser context, you can use DOMParser:
const html = "<h1>title</h1>";
const parser = new DOMParser();
const parsed = parser.parseFromString(html, "text/html");
console.log(parsed.firstChild.innerText); // "title"
and in node you can use node-html-parser:
import { parse } from 'node-html-parser';
const html = "<h1>title</h1>";
const parsed = parse(html);
console.log(parsed.firstChild.innerText); // "title"

Related

How to get all links from a website with puppeteer

Well, I would like a way to use the puppeteer and the for loop to get all the links on the site and add them to an array, in this case the links I want are not links that are in the html tags, they are links that are directly in the source code, javascript file links etc... I want something like this:
array = [ ]
for(L in links){
array.push(L)
//The code should take all the links and add these links to the array
}
But how can I get all references to javascript style files and all URLs that are in the source code of a website?
I just find a post and a question that teaches or shows how it gets the links from the tag and not all the links from the source code.
Supposing you want to get all the tags on this page for example:
view-source:https://www.nike.com/
How can I get all script tags and return to console? I put view-source:https://nike.com because you can get the script tags, I don't know if you can do it without displaying the source code, but I thought about displaying and getting the script tag because that was the idea I had, however I do not know how to do it

It is possible to get all links from a URL using only node.js, without puppeteer:
There are two main steps:
Get the source code for the URL.
Parse the source code for links.
Simple implementation in node.js:
// get-links.js
///
/// Step 1: Request the URL's html source.
///
axios = require('axios');
promise = axios.get('https://www.nike.com');
// Extract html source from response, then process it:
promise.then(function(response) {
htmlSource = response.data
getLinksFromHtml(htmlSource);
});
///
/// Step 2: Find links in HTML source.
///
// This function inputs HTML (as a string) and output all the links within.
function getLinksFromHtml(htmlString) {
// Regular expression that matches syntax for a link (https://stackoverflow.com/a/3809435/117030):
LINK_REGEX = /https?:\/\/(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&//=]*)/gi;
// Use the regular expression from above to find all the links:
matches = htmlString.match(LINK_REGEX);
// Output to console:
console.log(matches);
// Alternatively, return the array of links for further processing:
return matches;
}
Sample usage:
$ node get-links.js
[
'http://www.w3.org/2000/svg',
...
'https://s3.nikecdn.com/unite/scripts/unite.min.js',
'https://www.nike.com/android-icon-192x192.png',
...
'https://connect.facebook.net/',
... 658 more items
]
Notes:
I used the axios library for simplicity and to avoid "access denied" errors from nike.com. It is possible to use any other method to get the HTML source, like:
Native node.js http/https libraries
Puppeteer (Get complete web page source html with puppeteer - but some part always missing)

Although the other answers are applicable in many situations, they will not work for client-side rendered sites. For instance, if you just do an Axios request to Reddit, all you'll get is a couple of divs with some metadata. As Puppeteer actually gets the page and parses all JavaScript in a real browser, the websites' choice of document rendering becomes irrelevant for extracting page data.
Puppeteer has an evaluate method on the page object which allows you to run JavaScript directly on the page. Using that, you easily extract all links as follows:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
const pageUrls = await page.evaluate(() => {
const urlArray = Array.from(document.links).map((link) => link.href);
const uniqueUrlArray = [...new Set(urlArray)];
return uniqueUrlArray;
});
console.log(pageUrls);
await browser.close();
})();

yes you can get all the script tags and their links without opening view source.
You need to add dependency for jsdom library in your project and then pass the HTML response to its instance like below
here is the code:
const axios = require('axios');
const jsdom = require("jsdom");
// hit simple HTTP request using axios or node-fetch as you wish
const nikePageResponse = await axios.get('https://www.nike.com');
// now parse this response into a HTML document using jsdom library
const dom = new jsdom.JSDOM(nikePageResponse.data);
const nikePage = dom.window.document
// now get all the script tags by querying this page
let scriptLinks = []
nikePage.querySelectorAll('script[src]').forEach( script => scriptLinks.push(script.src.trim()));
console.debug('%o', scriptLinks)
Here I have made CSS selector for <script> tags that have src attribute inside them.
You can write same code in using puppeteer, but it will take time opening the browser and everything and then getting its pageSource.
you can use this to find the links and then do whatever you want to use with them using puppeteer or anything.

Web scraping using javascript

I want to make a site (with HTML, CSS and Javascript) which will scrape data from other sites. I want to use javascript in order to accomplish that. Which is the best way to do it? I would like to avoid using Node.js or some other framework.

If you are getting cors error just use cors anywhere.
For dom parsing use DomParser
Example:
fetch(`https://cors-anywhere.herokuapp.com/${url}`)
.then(response => response.text())
.then(html => {
const parser = new DOMParser()
const dom = parser.parseFromString(htmlContent, 'text/html')
// now you can select elements like for normal node
dom.querySelector('div')
})
Do you have any other problems?

Using XPath in node.js

I am building a little document parser in node.js. To test, I have a raw HTML file, that is generally downloaded from the real website when the application executes.
I want to extract the first code example from each section of the Console.WriteLine that matches my constraint - it has to be written in C#. To do that, I have this sample XPath:
//*[#id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::div/following-sibling::div/pre[position()>1]/code[contains(#class,'lang-csharp')]
If I test the XPath online, I get the expected results, which is in this Gist.
In my node.js application, I am using xmldom and xpath to try and parse that exact same information out:
var exampleLookup = `//*[#id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::div/following-sibling::div/pre[position()>1]/code[contains(#class,'lang-csharp')]`;
var doc = new dom().parseFromString(rawHtmlString, 'text/html');
var sampleNodes = xpath.select(exampleLookup,doc);
This does not return anything, however.
What might be going on here?

This is most likely caused by the default namespace (xmlns="http://www.w3.org/1999/xhtml") in your HTML (XHTML).
Looking at the xpath docs, you should be able to bind the namespace to a prefix using useNamespaces and use the prefix in your xpath (untested)...
var exampleLookup = `//*[#id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::x:div/following-sibling::x:div/x:pre[position()>1]/x:code[contains(#class,'lang-csharp')]`;
var doc = new dom().parseFromString(rawHtmlString, 'text/html');
var select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
var sampleNodes = xpath.select(exampleLookup,doc);
Instead of binding the namespace to a prefix, you could also use local-name() in your XPath, but I wouldn't recommend it. This is also covered in the docs.
Example...
//*[#id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::*[local-name()='div']/following-sibling::*[local-name()='div']/*[local-name()='pre'][position()>1]/*[local-name()='code'][contains(#class,'lang-csharp')]

There is a library xpath-html that can help you using XPath to query HTML, with minimal efforts and lines of code.
const fs = require("fs");
const html = fs.readFileSync(`${__dirname}/shopback.html`, "utf8");
const xpath = require("xpath-html");
const node = xpath.fromPageSource(html).findElement("//*[contains(text(), 'with love')]");
console.log(`The matched tag name is "${node.getTagName()}"`);
console.log(`Your full text is "${node.getText()}"`);

How to parse HTML data from JSON and render it in UIWebView (using Swift 3.0)

I'd like to use JSON2HTML to parse the HTML data from JSON and render it in an UIWebView (using Swift 3.0). Please let me know how to achieve it. Thanks in advance!
Here's what I've tried:
let jsfile1 = try!String(contentsOfFile: Bundle.main.path(forResource: "json2html", ofType: "js")!)
func loadJS()
{
var getData={}
var context = JSContext()
var valSwiftyJson:JSON = [:]
var test = context?.evaluateScript(jsfile1)
let testFunction = test?.objectForKeyedSubscript("json2html")
let urlString = //Have removed the URL string due to restrictions
Alamofire.request(urlString,encoding:JSONEncoding.default).responseJSON
{ response in
if let alamoJson = response.result.value
{
let swiftyJson = JSON(data:response.data!)
valSwiftyJson = swiftyJson["FormInfo"]["Form"]
print(valSwiftyJson)
}
}
let result = testFunction?.call(withArguments: [getData,valSwiftyJson])
webView.loadHTMLString((result?.toString())!, baseURL: nil)
}
Finally, I managed to solve the issue by creating an index.html file (locally stored) and I referred the JSON2HTML library inside it. I then added the JSON(HTML inside) content dynamically to it each time whenever I needed to convert JSON to HTML. At last I load the final index.html in the UIWebView (it worked like charm).

Are you talking about this library as JSON2HTML ? If so, I don't think there is a library for translating the JSON elements to HTML in Swift.
Do you plan to download the JSON elements from a back-end ? Then, as there is a node.js wrapper to JSON2HTML, I would recommend to do the translating from JSON to HTML on the same server. Thus you would just download the HTML compiled data and rendering it in the UIWebView would be as easy as this line of code (in Swift 3) :
// html is the HTML data downloaded from your back-end
webView.mainFrame.loadHTMLString(html, baseURL: nil)

Converting multiple files into HTML (from Markdown)?

I'm currently working on a small project in which I want to convert couple (or more) Markdown files into HTML and then append them to the main document. I want all this to take place client-side. I have chose couple of plugins such as Showdown (Markdown to HTML converter), jQuery (overall DOM manipulation), and Underscore (for simple templating if necessary). I'm stuck where I can't seem to convert a file into HTML (into a string which has HTML in it).
Converting Markdown into HTML is simple enough:
var converter = new Showdown.converter();
converter.makeHtml('#hello markdown!');
I'm not sure how to fetch (download) a file into the code (string?).
How do I fetch a file from a URL (that URL is a Markdown file), pass it through Showdown and then get a HTML string? I'm only using JavaScript by the way.

You can get an external file and parse it to a string with ajax. The jQuery way is cleaner, but a vanilla JS version might look something like this:
var mdFile = new XMLHttpRequest();
mdFile.open("GET", "http://mypath/myFile.md", true);
mdFile.onreadystatechange = function(){
// Makes sure the document exists and is ready to parse.
if (mdFile.readyState === 4 && mdFile.status === 200)
{
var mdText = mdFile.responseText;
var converter = new showdown.Converter();
converter.makeHtml(mdText);
//Do whatever you want to do with the HTML text
}
}
jQuery Method:
$.ajax({
url: "info.md",
context: document.body,
success: function(mdText){
//where text will be the text returned by the ajax call
var converter = new showdown.Converter();
var htmlText = converter.makeHtml(mdText);
$(".outputDiv").append(htmlText); //append this to a div with class outputDiv
}
});
Note: This assumes the files you want to parse are on your own server. If the files are on the client (IE user files) you'll need to take a different approach
Update
The above methods will work if the files you want are on the same server as you. If they are NOT then you will have to look into CORS if you control the remote server, and a server side solution if you do not. This question provides some relevant background on cross-domain requests.

Once you have the HTML string, you can append to the whatever DOM element you wish, by simply calling:
var myElement = document.getElementById('myElement');
myElement.innerHTML += markdownHTML;
...where markdownHTML is the html gotten back from makeHTML.

We Keep Coding

JavaScript is the programming language of the Web.

Web parser in Javascript like BeautifulSoup in Python - javascript

Related

How to get all links from a website with puppeteer

Web scraping using javascript

Using XPath in node.js

How to parse HTML data from JSON and render it in UIWebView (using Swift 3.0)

Converting multiple files into HTML (from Markdown)?

Categories

Resources