I am building a little document parser in node.js. To test, I have a raw HTML file, that is generally downloaded from the real website when the application executes.
I want to extract the first code example from each section of the Console.WriteLine that matches my constraint - it has to be written in C#. To do that, I have this sample XPath:
//*[#id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::div/following-sibling::div/pre[position()>1]/code[contains(#class,'lang-csharp')]
If I test the XPath online, I get the expected results, which is in this Gist.
In my node.js application, I am using xmldom and xpath to try and parse that exact same information out:
var exampleLookup = `//*[#id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::div/following-sibling::div/pre[position()>1]/code[contains(#class,'lang-csharp')]`;
var doc = new dom().parseFromString(rawHtmlString, 'text/html');
var sampleNodes = xpath.select(exampleLookup,doc);
This does not return anything, however.
What might be going on here?
This is most likely caused by the default namespace (xmlns="http://www.w3.org/1999/xhtml") in your HTML (XHTML).
Looking at the xpath docs, you should be able to bind the namespace to a prefix using useNamespaces and use the prefix in your xpath (untested)...
var exampleLookup = `//*[#id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::x:div/following-sibling::x:div/x:pre[position()>1]/x:code[contains(#class,'lang-csharp')]`;
var doc = new dom().parseFromString(rawHtmlString, 'text/html');
var select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
var sampleNodes = xpath.select(exampleLookup,doc);
Instead of binding the namespace to a prefix, you could also use local-name() in your XPath, but I wouldn't recommend it. This is also covered in the docs.
Example...
//*[#id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::*[local-name()='div']/following-sibling::*[local-name()='div']/*[local-name()='pre'][position()>1]/*[local-name()='code'][contains(#class,'lang-csharp')]
There is a library xpath-html that can help you using XPath to query HTML, with minimal efforts and lines of code.
const fs = require("fs");
const html = fs.readFileSync(`${__dirname}/shopback.html`, "utf8");
const xpath = require("xpath-html");
const node = xpath.fromPageSource(html).findElement("//*[contains(text(), 'with love')]");
console.log(`The matched tag name is "${node.getTagName()}"`);
console.log(`Your full text is "${node.getText()}"`);
Related
I came from Python, where with beautiful soup you can parse the entire html tree without creating get requests in external web pages. I'm looking for the same in javascript, but I've only found jsdom and jssoup (which seems unused) and if I'm correct, they only allow you to make requests.
I want a library in js which allows me to parse the entire html tree without getting CORS policy errors, that is, without making request, just parsing it.
How can I do this?
In a browser context, you can use DOMParser:
const html = "<h1>title</h1>";
const parser = new DOMParser();
const parsed = parser.parseFromString(html, "text/html");
console.log(parsed.firstChild.innerText); // "title"
and in node you can use node-html-parser:
import { parse } from 'node-html-parser';
const html = "<h1>title</h1>";
const parsed = parse(html);
console.log(parsed.firstChild.innerText); // "title"
Well, I would like a way to use the puppeteer and the for loop to get all the links on the site and add them to an array, in this case the links I want are not links that are in the html tags, they are links that are directly in the source code, javascript file links etc... I want something like this:
array = [ ]
for(L in links){
array.push(L)
//The code should take all the links and add these links to the array
}
But how can I get all references to javascript style files and all URLs that are in the source code of a website?
I just find a post and a question that teaches or shows how it gets the links from the tag and not all the links from the source code.
Supposing you want to get all the tags on this page for example:
view-source:https://www.nike.com/
How can I get all script tags and return to console? I put view-source:https://nike.com because you can get the script tags, I don't know if you can do it without displaying the source code, but I thought about displaying and getting the script tag because that was the idea I had, however I do not know how to do it
It is possible to get all links from a URL using only node.js, without puppeteer:
There are two main steps:
Get the source code for the URL.
Parse the source code for links.
Simple implementation in node.js:
// get-links.js
///
/// Step 1: Request the URL's html source.
///
axios = require('axios');
promise = axios.get('https://www.nike.com');
// Extract html source from response, then process it:
promise.then(function(response) {
htmlSource = response.data
getLinksFromHtml(htmlSource);
});
///
/// Step 2: Find links in HTML source.
///
// This function inputs HTML (as a string) and output all the links within.
function getLinksFromHtml(htmlString) {
// Regular expression that matches syntax for a link (https://stackoverflow.com/a/3809435/117030):
LINK_REGEX = /https?:\/\/(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&//=]*)/gi;
// Use the regular expression from above to find all the links:
matches = htmlString.match(LINK_REGEX);
// Output to console:
console.log(matches);
// Alternatively, return the array of links for further processing:
return matches;
}
Sample usage:
$ node get-links.js
[
'http://www.w3.org/2000/svg',
...
'https://s3.nikecdn.com/unite/scripts/unite.min.js',
'https://www.nike.com/android-icon-192x192.png',
...
'https://connect.facebook.net/',
... 658 more items
]
Notes:
I used the axios library for simplicity and to avoid "access denied" errors from nike.com. It is possible to use any other method to get the HTML source, like:
Native node.js http/https libraries
Puppeteer (Get complete web page source html with puppeteer - but some part always missing)
Although the other answers are applicable in many situations, they will not work for client-side rendered sites. For instance, if you just do an Axios request to Reddit, all you'll get is a couple of divs with some metadata. As Puppeteer actually gets the page and parses all JavaScript in a real browser, the websites' choice of document rendering becomes irrelevant for extracting page data.
Puppeteer has an evaluate method on the page object which allows you to run JavaScript directly on the page. Using that, you easily extract all links as follows:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
const pageUrls = await page.evaluate(() => {
const urlArray = Array.from(document.links).map((link) => link.href);
const uniqueUrlArray = [...new Set(urlArray)];
return uniqueUrlArray;
});
console.log(pageUrls);
await browser.close();
})();
yes you can get all the script tags and their links without opening view source.
You need to add dependency for jsdom library in your project and then pass the HTML response to its instance like below
here is the code:
const axios = require('axios');
const jsdom = require("jsdom");
// hit simple HTTP request using axios or node-fetch as you wish
const nikePageResponse = await axios.get('https://www.nike.com');
// now parse this response into a HTML document using jsdom library
const dom = new jsdom.JSDOM(nikePageResponse.data);
const nikePage = dom.window.document
// now get all the script tags by querying this page
let scriptLinks = []
nikePage.querySelectorAll('script[src]').forEach( script => scriptLinks.push(script.src.trim()));
console.debug('%o', scriptLinks)
Here I have made CSS selector for <script> tags that have src attribute inside them.
You can write same code in using puppeteer, but it will take time opening the browser and everything and then getting its pageSource.
you can use this to find the links and then do whatever you want to use with them using puppeteer or anything.
I'm trying to use NodeJS to modify an external HTML file (which is located in the same directory). In my index.js file I write:
fs.readFile('index.html', (err,html)=>{
if(err){
throw err;
}
html.body.innerHTML += '<div id = "asdf"></div>';
});
As index.html is a valid document. But it doesn't look to be reading it properly, as I get as an error:
"TypeError: Cannot read property 'innerHTML' of undefined".
I guess that html is not getting anything as body.
How can I do changes in HTML using JavaScript?
Here is an example using node-html-parse
HTML file
<html>
<body>
<div id="fist">yolo</div>
</body>
</html>
And the nodejs
const fs = require('fs');
const parse = require('node-html-parser').parse;
fs.readFile('index.html', 'utf8', (err,html)=>{
if(err){
throw err;
}
const root = parse(html);
const body = root.querySelector('body');
//body.set_content('<div id = "asdf"></div>');
body.appendChild('<div id = "asdf"></div>');
console.log(root.toString()); // This you can write back to file!
});
There might be better solutions than node-html-parser, considering the amount of downloads. For example, htmlparser2 has much more downloads, but it also looks more complex :)
In order to manipulate an html file the way you'd be able to in a browser, you'll first need to parse it.
Perhaps node-html-parser can be of use? (Or if a few milliseconds of parsing are not a concern and you want some more functionality, the JSDOM package is very popular too.)
innerHTML is a function provided after DOM parsing. Here you are using a string, so you can either use a DOM parser to create the structure or you can just use regex to isolate the part you want to replace and append the text.
html.replace("</body>",'<div id = "asdf"></div></body>');
Sorry about the vague title but I'm a bit lost so it's hard to be specific. I've started playing around with Firefox extensions using the add-on SDK. What I'm trying to to is to watch a page for changes, a Twitch.tv chat window in this case, and save those changes to a file.
I've gotten this to work, every time something changes on the page it gets saved. But, "unusual" characters like for example something in Korean doesn't get saved properly. I think this has to do with encoding of the file/string? I tried saving the same characters by copy-pasting them into notepad, it asked me to save in Unicode and when I did everything worked fine. So I figured, ok, I'll change the encoding of the log file to unicode as well before writing to it. Didn't exactly work... Now all the characters were in some kind of foreign language.
The code I'm using to write to the file is this:
var {Cc, Ci, Cu} = require("chrome");
var {FileUtils} = Cu.import("resource://gre/modules/FileUtils.jsm");
var file = FileUtils.getFile("Desk", ["mylogfile.txt"]);
var stream = FileUtils.openFileOutputStream(file, FileUtils.MODE_WRONLY | FileUtils.MODE_CREATE | FileUtils.MODE_APPEND);
stream.write(data, data.length);
stream.close();
I looked at the description of FileUtils.jsm over at MDN and as far as I can tell there's no way to tell it which encoding I want to use?
If you don't know a fix could you give me some good search terms because I seem to be coming up short on that front. Since I know basically nothing on the subject I'm flailing around in the dark a bit at the moment.
edit:
This is what I ended up with (for now) to get this thing working:
var {Cc, Ci, Cu} = require("chrome");
var {FileUtils} = Cu.import("resource://gre/modules/FileUtils.jsm");
var file = Cc['#mozilla.org/file/local;1']
.createInstance(Ci.nsILocalFile);
file.initWithPath('C:\\temp\\temp.txt');
if(!file.exists()){
file.create(file.NORMAL_FILE_TYPE, 0666);
}
var charset = 'UTF-8';
var fileStream = Cc['#mozilla.org/network/file-output-stream;1']
.createInstance(Ci.nsIFileOutputStream);
fileStream.init(file, FileUtils.MODE_WRONLY | FileUtils.MODE_CREATE | FileUtils.MODE_APPEND, 0x200, false);
var converterStream = Cc['#mozilla.org/intl/converter-output-stream;1']
.createInstance(Ci.nsIConverterOutputStream);
converterStream.init(fileStream, charset, data.length,
Ci.nsIConverterInputStream.DEFAULT_REPLACEMENT_CHARACTER);
converterStream.writeString(data);
converterStream.close();
fileStream.close();
Dumping just the raw bytes (well, raw jschars actually) won't work. You need to first convert the data into some sensible encoding.
See e.g. the File I/O Snippets. Here are the crucial bits of creating a converter output stream wrapper:
var converter = Components.classes["#mozilla.org/intl/converter-output-stream;1"].
createInstance(Components.interfaces.nsIConverterOutputStream);
converter.init(foStream, "UTF-8", 0, 0);
converter.writeString(data);
converter.close(); // this closes foStream
Another way is to use OS.File + TextConverter:
let encoder = new TextEncoder(); // This encoder can be reused for several writes
let array = encoder.encode("This is some text"); // Convert the text to an array
let promise = OS.File.writeAtomic("file.txt", array, // Write the array atomically to "file.txt", using as temporary
{tmpPath: "file.txt.tmp"}); // buffer "file.txt.tmp".
It might be even possible to mix both. OS.File has the benefit that it will write data and access files off the main thread (so it won't block the UI while the file is being written).
how can i append data to a file using javascript?
i tried to use this code, but i got an error:
var fso = new ActiveXObject("Scripting.FileSystemOject");
var filepath = fso.GetFile("member.txt");
var fileObject = fso.OpenTextFile(filepath, 8);
file.WriteLine(id + "|" + pass);
fileObject.close();
the error is on var fso = new ActiveXObject("Scripting.FileSystemOject");, written: Error: Automation server can't create object
is there any other way to append the file using javascript or the way to fix this? thanks :)
EDIT:
i have doing what's written on this, and it still not working :/
I just realized these in your code:
var fileObject = fso.OpenTextFile(filepath, 8,true);
You'll need the true-argument, if the file does not exist, or you want to overwrite/append it.
var filepath = fso.GetFile("member.txt");// This won't work.
var filepath = "your_filePath"; // Use this instead
var fileObject = fso.OpenTextFile(filepath, 8, true);
OpenTextFile() needs a path as a string like "D:/test/file.txt". GetFile() returns an object, which you can see as a string (D:\test\file.txt), but it's not a string. Use also absolute paths, relative paths don't seem to work by my experience.
EDIT
Add the code below to the <head>-part of your html-file, then save locally as a hta (with file extension hta, not htm or html).
<hta:application
applicationName="MyApp"
id="myapp"
singleInstance="yes"
/>
Then run the hta-file. If you still getting an ActiveX-error, it's not supported by your OS. If this works, you haven't done all the security settings correct.
EDIT II
In this case it's not very usefull to get the path through ActiveX, you'll need to write it literal anyway. And I'm not supposed to do your homeworks, but this does the trick...
var filepath = new String(fso.GetFile("member.txt")).replace(/\\/g,'/');
And don't forget what I've said above about using absolute paths...
The 8 in the OpenTextFile function specify that you want to append to the file. Your problem comes from the security restriction of your browser. To make it work you'll have to lower the security level, which is not really recommended.
The error is thrown because there are security restrictions which donot allow the activex to run. change your security settings to allow the activex if your using internet explorer (which i think you are).
This might be useful http://windows.microsoft.com/en-US/windows/help/genuine/ie-activex
Cheers
EDIT: i have doing what's written on this, and it still not working :/
* try Restarting your browser
As pointed out in this comment
Javascript: how to append data to a file
the cause of the error Error: Automation server can't create object is the typo in the progid passed to ActiveXObject: Oject instead of Object:
var fso = new ActiveXObject("Scripting.FileSystemOject");
there is a missing b!