Getting all the text content from a HTML string in NodeJS

Getting all the text content from a HTML string in NodeJS - javascript

I need to get only the text content from a HTML String with a space or a line break separating the text content of different elements.
For example, the HTML String might be:
<ul>
<li>First</li>
<li>Second</li>
</ul>
What I want:
First Second
or
First
Second
I've tried to get the text content by first wrapping the entire string inside a div and then getting the textContent using third party libraries. But, there is no spacing or line breaks between text content of different elements which I specifically require (i.e. I get FirstSecond which is not what I want).
The only solution I am thinking of right now is to make a DOM Tree and then apply recursion to get the nodes that contain text, and then append the text of that element to a string with spaces.
Are there any cleaner, neater, and simpler solution than this?

Convert HTML to Plain Text:
In your terminal, install the html-to-text npm package:
npm install html-to-text
Then in JavaScript::
const { convert } = require('html-to-text'); // Import the library
var htmlString = `
<ul>
<li>First</li>
<li>Second</li>
</ul>
`;
var text = convert(htmlString, { wordwrap: 130 })
// Out:
// First
// Second
Hope this helps!

You can try get rid of html tags using regex, for the yours example try the following:
let str = `<ul>
<li>First</li>
<li>Second</li>
</ul>`
console.log(str)
let regex = '<\/?!?(li|ul)[^>]*>'
var re = new RegExp(regex, 'g');
str = str.replace(re, '');
console.log(str)

Okay you can try this example, This may help you
I used JSDom module
https://www.npmjs.com/package/jsdom
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
console.log(dom.window.document.querySelector("p").textContent);
BTW Helped me
This code can help I think :)

Using the DOM, you could use document.Node.textContent. However, NodeJs doesn't have textContent (since it doesn't have native access to the DOM), therefore you should use external packages. You could install request and cheerio, using npm. cheerio, suggested by Jon Church, is maybe the easiest web scraping tool to use (there are also complexer ones like jsdom)
With power of cheerio and request in your hands, you could write
const request = require("request");
const cheerio = require("cheerio");
const fs = require("fs");
//taken from https://stackoverflow.com/a/19709846/10713877
function is_absolute(url)
{
var r = new RegExp('^(?:[a-z]+:)?//', 'i');
return r.test(url);
}
function is_local(url)
{
var r = new RegExp('^(?:file:)?//', 'i');
return (r.test(url) || !is_absolute(url));
}
function send_request(URL)
{
if(is_local(URL))
{
if(URL.slice(0,7)==="file://")
url_tmp = URL.slice(7,URL.length);
else
url_tmp = URL;
//taken from https://stackoverflow.com/a/20665078/10713877
const $ = cheerio.load(fs.readFileSync(url_tmp));
//Do something
console.log($.text())
}
else
{
var options = {
url: URL,
headers: {
'User-Agent': 'Your-User-Agent'
}
};
request(options, function(error, response, html) {
//no error
if(!error && response.statusCode == 200)
{
console.log("Success");
const $ = cheerio.load(html);
return Promise.resolve().then(()=> {
//Do something
console.log($.text())
});
}
else
{
console.log(`Failure: ${error}`);
}
});
}
}
Let me explain the code. You pass a URL to send_request function. It checks whether the URL string is a path to your local file, (a relative path, or a path starting with file://). If it is a local file, it proceeds to use cheerio module, otherwise, it has to send a request, to the website, using the request module, then use cheerio module. Regular Expressions are used in is_absolute and is_local. You get the text using text() method provided by cheerio. Under the comments //Do something, you could do whatever you want with the text.
There are websites that let you know 'Your-User-Agent', copy-paste your user agent to that field.
Below lines will work
//your local file
send_request("/absolute/path/to/your/local/index.html");
send_request("/relative/path/to/your/local/index.html");
send_request("file:///absolute/path/to/your/local/index.html");
//website
send_request("https://stackoverflow.com/");
EDIT: I am on a linux system.

You can try using npm library htmlparser2. Its will be very simple using this
const htmlparser2 = require('htmlparser2');
const htmlString = ''; //your html string goes here
const parser = new htmlparser2.Parser({
ontext(text) {
if (text && text.trim().length > 0) {
//do as you need, you can concatenate or collect as string array
}
}
});
parser.write(htmlString);
parser.end();

Related

Reading a web page with node.js and urllib

I'm learning programming and found myself in a tough spot; the code from the tutorial is not working and I can't understand why.
It's a shell script that's supposed to retrieve a wikipedia page, strip it of the references, and return just the paragraphs text.
It uses the urllib library. In the code below, the only difference from the tutorial's is the use of fs to make a text file with the page content. The rest is copied and pasted.
#!/usr/local/bin/node
// Returns the paragraphs from a Wikipedia link, stripped of reference numbers.
let urllib = require("urllib");
let url = process.argv[2];
let fs = require("fs");
console.log(url);
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
urllib.request(url, { followRedirect: true }, function(error, data, response) {
let body = data.toString();
// Simulate a Document Object Model.
let { document } = (new JSDOM(body)).window;
// Grab all the paragraphs and references.
let paragraphs = document.querySelectorAll("p");
let references = document.querySelectorAll(".reference");
// Remove any references.
references.forEach(function(reference) {
reference.remove();
});
// Print out all of the paragraphs.
paragraphs.forEach(function(paragraph) {
console.log(paragraph.textContent);
fs.appendFileSync("article.txt", `${paragraph}\n`);
});
});
My first guess, was that urllib was not working for some reason. This cause, even if I installed it as per official documentation, when I type which urllib at the command line, it doesn't return a path.
But then, node doesn't return an error for not knowing what the require("urllib") is when I run the file.
The actual output is the following:
$ ./wikp https://es.wikipedia.org/wiki/JavaScript
https://es.wikipedia.org/wiki/JavaScript
$
Can anybody help please?

I think the tutorial you followed might have been a little out of date.
This works for me:
let urllib = require("urllib");
let url = process.argv[2];
let fs = require("fs");
console.log(url);
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
urllib.request(url, { followRedirect: true }).then(({data, res}) => {
let body = data.toString();
// Simulate a Document Object Model.
let { document } = (new JSDOM(body)).window;
// Grab all the paragraphs and references.
let paragraphs = document.querySelectorAll("p");
let references = document.querySelectorAll(".reference");
// Remove any references.
references.forEach(function(reference) {
reference.remove();
});
// Print out all of the paragraphs.
paragraphs.forEach(function(paragraph) {
console.log(paragraph.textContent);
fs.appendFileSync("article.txt", `${paragraph.textContent}\n`);
});
});
The package you are using (urllib) is using promises, that might have been different in the past, when the tutorial was released.

Editing an XML document

I am new to JavaScript and need the ability to create, edit and export an XML document on the server side. I have seen different options on the Internet, but they do not suit me.
It seems that I found one suitable option with processing my XML file into JSON, and then back and then export it through another plugin, but maybe there is some way to make it easier?
Thanks!

I recently came across a similar problem. The solution turned out to be very simple. It is to use XML-Writer
In your project folder, first install it via the console
npm install xml-writer
Next, first import it and create a new file to parse what's going on here:
var XMLWriter = require ('xml-writer');
xw = new XMLWriter;
xw.startDocument ();
xw.startElement ('root');
xw.writeAttribute ('foo', 'value');
xw.text ('Some content');
xw.endDocument ();
console.log (xw.toString ());
You can find more information here and at the bottom of the page see the different code for each item. In this way, you can create, edit and export xml files. Good luck and if something is not clear, write!
Additional
You will need also fs module
const fs = require("fs")
const xmlParser = require("xml2json")
const formatXml = require("xml-formatter")
Completed code:
const fs = require("fs")
const xmlParser = require("xml2json")
const formatXml = require("xml-formatter")
var XMLWriter = require('xml-writer');
xw = new XMLWriter;
xw.startDocument();
xw.startElement('root');
xw.startElement('man');
xw.writeElement('name', 'Sergio');
xw.writeElement('adult', 'no');
xw.endElement();
xw.startElement('item');
xw.writeElement('name', 'phone');
xw.writeElement('price', '305.77');
xw.endElement();
xw.endDocument();
const stringifiedXmlObj = JSON.stringify(xmlObj)
const finalXml = xmlParser.toXml(stringifiedXmlObj)
fs.writeFile("./datax.xml", formatXml(finalXml, { collapseContent: true }), function (err, result) {
if (err) {
console.log("Error")
} else {
console.log("Xml file successfully updated.")
}
})
})

How do I write a LZ compressed string to text file using JXA?

I am trying to write a JXA script in Apple Script Editor, that compresses a string using the LZ algorithm and writes it to a text (JSON) file:
var story = "Once upon a time in Silicon Valley..."
var storyC = LZString.compress(story)
var data_to_write = "{\x22test\x22\x20:\x20\x22"+storyC+"\x22}"
app.displayAlert(data_to_write)
var desktopString = app.pathTo("desktop").toString()
var file = `${desktopString}/test.json`
writeTextToFile(data_to_write, file, true)
Everything works, except that the LZ compressed string is just transformed to a set of "?" by the time it reaches the output file, test.json.
It should look like:
{"test" : "㲃냆੠Њޱᐈ攀렒삶퓲ٔ쀛䳂䨀푖㢈Ӱນꀀ"}
Instead it looks like:
{"test" : "????????????????????"}
I have a feeling the conversion is happening in the app.write command used by the writeTextToFile() function (which I pulled from an example in Apple's Mac Automation Scripting Guide):
var app = Application.currentApplication()
app.includeStandardAdditions = true
function writeTextToFile(text, file, overwriteExistingContent) {
try {
// Convert the file to a string
var fileString = file.toString()
// Open the file for writing
var openedFile = app.openForAccess(Path(fileString), { writePermission: true })
// Clear the file if content should be overwritten
if (overwriteExistingContent) {
app.setEof(openedFile, { to: 0 })
}
// Write the new content to the file
app.write(text, { to: openedFile, startingAt: app.getEof(openedFile) })
// Close the file
app.closeAccess(openedFile)
// Return a boolean indicating that writing was successful
return true
}
catch(error) {
try {
// Close the file
app.closeAccess(file)
}
catch(error) {
// Report the error is closing failed
console.log(`Couldn't close file: ${error}`)
}
// Return a boolean indicating that writing was successful
return false
}
}
Is there a substitute command for app.write that maintains the LZ compressed string / a better way to accomplish what I am trying to do?
In addition, I am using the readFile() function (also from the Scripting Guide) to load the LZ string back into the script:
function readFile(file) {
// Convert the file to a string
var fileString = file.toString()
// Read the file and return its contents
return app.read(Path(fileString))
}
But rather than returning:
{"test" : "㲃냆੠Њޱᐈ攀렒삶퓲ٔ쀛䳂䨀푖㢈Ӱນꀀ"}
It is returning:
"{\"test\" : \"„≤ÉÎÉÜ‡©†–äÓÄéﬁ±·êàÊîÄÎ†íÏÇ∂Ìì≤ŸîÏÄõ‰≥Ç‰®ÄÌëñ„¢à”∞‡∫ôÍÄÄ\"}"
Does anybody know a fix for this too?
I know that it is possible to use Cocoa in JXA scripts, so maybe the solution lies therein?
I am just getting to grips with JavaScript so I'll admit trying to grasp Objective-C or Swift is way beyond me right now.
I look forward to any solutions and/or pointers that you might be able to provide me. Thanks in advance!

After some further Googl'ing, I came across these two posts:
How can I write UTF-8 files using JavaScript for Mac Automation?
read file as class utf8
I have thus altered my script accordingly.
writeTextToFile() now looks like:
function writeTextToFile(text, file) {
// source: https://stackoverflow.com/a/44293869/11616368
var nsStr = $.NSString.alloc.initWithUTF8String(text)
var nsPath = $(file).stringByStandardizingPath
var successBool = nsStr.writeToFileAtomicallyEncodingError(nsPath, false, $.NSUTF8StringEncoding, null)
if (!successBool) {
throw new Error("function writeFile ERROR:\nWrite to File FAILED for:\n" + file)
}
return successBool
};
While readFile() looks like:
ObjC.import('Foundation')
const readFile = function (path, encoding) {
// source: https://github.com/JXA-Cookbook/JXA-Cookbook/issues/25#issuecomment-271204038
pathString = path.toString()
!encoding && (encoding = $.NSUTF8StringEncoding)
const fm = $.NSFileManager.defaultManager
const data = fm.contentsAtPath(pathString)
const str = $.NSString.alloc.initWithDataEncoding(data, encoding)
return ObjC.unwrap(str)
};
Both use Objective-C to overcome app.write and app.read's inability to handle UTF-8.

Replace regular expression in text file with file contents using node.js

This question would apply to any text file but as I want to use it for HTML replacements I will use HTML files for examples. I have looked at things like gulp inject and replace on npm but neither seamed to do quite what i needed.
I would like to have some placeholder text that references another file. when run through this replacement function the placeholdler text is replaced by the contents of the file.
main.html
<script><replace src="./other.js" /></script>
other.js
console.log("Hello, world!");
After the transformation the output file should be.
<script>console.log("Hello, world!")</script>
I have got to the following but don't know how to make it work with file streams in node.
var REGEX = /<replace src="(.+)" \/>/;
function replace(file){
match = file.match(REGEX);
var placeholder = match[0];
if (placeholder) {
return file.replace(placeholder, match[1].toUpperCase());
// toUpperCase is just an example and instead should lookup a file for contents
}
}

If your files are reasonable in size, you can avoid using streams and go with a custom replacer for string.replace():
var fs = require('fs');
function replace(path) {
var REGEX = /<replace src="(.+)" \/>/g;
// load the html file
var fileContent = fs.readFileSync(path, 'utf8');
// replacePath is your match[1]
fileContent = fileContent.replace(REGEX, function replacer(match, replacePath) {
// load and return the replacement file
return fs.readFileSync(replacePath, 'utf8');
});
// this will overwrite the original html file, change the path for test
fs.writeFileSync(path, fileContent);
}
replace('./main.html');
Using streams
var es = require('event-stream');
function replaceWithStreams(path) {
var REGEX = /<replace src="(.+)" \/>/g;
fs.createReadStream(path, 'utf8')
.pipe(es.split()) // split the input file into lines
.pipe(es.map(function (line, next) {
line = line.replace(REGEX, function replacer(match, replacePath) {
// better to keep a readFileSync here for clarity
return fs.readFileSync(replacePath, 'utf8');
});
next(null, line);
})).pipe(fs.createWriteStream(path)); // change path if needed
}
replaceWithStreams('./main.html');

You can try this: https://gist.github.com/thebergamo/5ee9f589757ee904f882
Basically you will read your index.html, find the replace tag, read the content in the file and replace to the content in your other file.
And this snippet is promisified to help you to treat exceptions =D

How to get the file-path of the currently executing javascript code

I'm trying to do something like a C #include "filename.c", or PHP include(dirname(__FILE__)."filename.php") but in javascript. I know I can do this if I can get the URL a js file was loaded from (e.g. the URL given in the src attribute of the tag). Is there any way for the javascript to know that?
Alternatively, is there any good way to load javascript dynamically from the same domain (without knowing the domain specifically)? For example, lets say we have two identical servers (QA and production) but they clearly have different URL domains. Is there a way to do something like include("myLib.js"); where myLib.js will load from the domain of the file loading it?
Sorry if thats worded a little confusingly.

Within the script:
var scripts = document.getElementsByTagName("script"),
src = scripts[scripts.length-1].src;
This works because the browser loads and executes scripts in order, so while your script is executing, the document it was included in is sure to have your script element as the last one on the page. This code of course must be 'global' to the script, so save src somewhere where you can use it later. Avoid leaking global variables by wrapping it in:
(function() { ... })();

All browsers except Internet Explorer (any version) have document.currentScript, which always works always (no matter how the file was included (async, bookmarklet etc)).
If you want to know the full URL of the JS file you're in right now:
var script = document.currentScript;
var fullUrl = script.src;
Tadaa.

I just made this little trick :
window.getRunningScript = () => {
return () => {
return new Error().stack.match(/([^ \n])*([a-z]*:\/\/\/?)*?[a-z0-9\/\\]*\.js/ig)[0]
}
}
console.log('%c Currently running script:', 'color: blue', getRunningScript()())
✅ Works on: Chrome, Firefox, Edge, Opera
Enjoy !

The accepted answer here does not work if you have inline scripts in your document. To avoid this you can use the following to only target <script> tags with a [src] attribute.
/**
* Current Script Path
*
* Get the dir path to the currently executing script file
* which is always the last one in the scripts array with
* an [src] attr
*/
var currentScriptPath = function () {
var scripts = document.querySelectorAll( 'script[src]' );
var currentScript = scripts[ scripts.length - 1 ].src;
var currentScriptChunks = currentScript.split( '/' );
var currentScriptFile = currentScriptChunks[ currentScriptChunks.length - 1 ];
return currentScript.replace( currentScriptFile, '' );
}
This effectively captures the last external .js file, solving some issues I encountered with inline JS templates.

Refining upon the answers found here I came up with the following:
getCurrentScript.js
var getCurrentScript = function() {
if (document.currentScript) {
return document.currentScript.src;
} else {
var scripts = document.getElementsByTagName('script');
return scripts[scripts.length - 1].src;
}
}
// module.exports = getCurrentScript;
console.log({log: getCurrentScript()})
getCurrentScriptPath.js
var getCurrentScript = require('./getCurrentScript');
var getCurrentScriptPath = function () {
var script = getCurrentScript();
var path = script.substring(0, script.lastIndexOf('/'));
return path;
};
module.exports = getCurrentScriptPath;
BTW: I'm using CommonJS
module format and bundling with webpack.

I've more recently found a much cleaner approach to this, which can be executed at any time, rather than being forced to do it synchronously when the script loads.
Use stackinfo to get a stacktrace at a current location, and grab the info.file name off the top of the stack.
info = stackinfo()
console.log('This is the url of the script '+info[0].file)

I've coded a simple function which allows to get the absolute location of the current javascript file, by using a try/catch method.
// Get script file location
// doesn't work for older browsers
var getScriptLocation = function() {
var fileName = "fileName";
var stack = "stack";
var stackTrace = "stacktrace";
var loc = null;
var matcher = function(stack, matchedLoc) { return loc = matchedLoc; };
try {
// Invalid code
0();
} catch (ex) {
if(fileName in ex) { // Firefox
loc = ex[fileName];
} else if(stackTrace in ex) { // Opera
ex[stackTrace].replace(/called from line \d+, column \d+ in (.*):/gm, matcher);
} else if(stack in ex) { // WebKit, Blink and IE10
ex[stack].replace(/at.*?\(?(\S+):\d+:\d+\)?$/g, matcher);
}
return loc;
}
};
You can see it here.

Refining upon the answers found here:
little trick
getCurrentScript and getCurrentScriptPath
I came up with the following:
//Thanks to https://stackoverflow.com/a/27369985/5175935
var getCurrentScript = function() {
if (document.currentScript && (document.currentScript.src !== ''))
return document.currentScript.src;
var scripts = document.getElementsByTagName('script'),
str = scripts[scripts.length - 1].src;
if (str !== '')
return str ;
//Thanks to https://stackoverflow.com/a/42594856/5175935
return new Error().stack.match(/(https?:[^:]*)/)[0];
};
//Thanks to https://stackoverflow.com/a/27369985/5175935
var getCurrentScriptPath = function() {
var script = getCurrentScript(),
path = script.substring(0, script.lastIndexOf('/'));
return path;
};
console.log({path: getCurrentScriptPath()})

Regardless of whether its a script, a html file (for a frame, for example), css file, image, whatever, if you dont specify a server/domain the path of the html doc will be the default, so you could do, for example,
<script type=text/javascript src='/dir/jsfile.js'></script>
or
<script type=text/javascript src='../../scripts/jsfile.js'></script>
If you don't provide the server/domain, the path will be relative to either the path of the page or script of the main document's path

I may be misunderstanding your question but it seems you should just be able to use a relative path as long as the production and development servers use the same path structure.
<script language="javascript" src="js/myLib.js" />

I've thrown together some spaghetti code that will get the current .js file ran (ex. if you run a script with "node ." you can use this to get the directory of the script that's running)
this gets it as "file://path/to/directoryWhere/fileExists"
var thisFilesDirectoryPath = stackinfo()[0].traceline.substring("readFile (".length, stackinfo()[0].traceline.length - ")".length-"readFile (".length);
this requires an npm package (im sure its on other platforms as well):
npm i stackinfo
import stackinfo from 'stackinfo'; or var {stackinfo} = require("stackinfo");

function getCurrnetScriptName() {
const url = new URL(document.currentScript.src);
const {length:len, [len-1]:last} = url.pathname.split('/');
return last.slice(0,-3);
}

We Keep Coding

JavaScript is the programming language of the Web.

Getting all the text content from a HTML string in NodeJS - javascript

You can try get rid of html tags using regex, for the yours example try the following: let str = `<ul> <li>First</li> <li>Second</li> </ul>` console.log(str) let regex = '<\/?!?(li|ul)[^>]*>' var re = new RegExp(regex, 'g'); str = str.replace(re, ''); console.log(str)

Related

Reading a web page with node.js and urllib

Editing an XML document

How do I write a LZ compressed string to text file using JXA?

Replace regular expression in text file with file contents using node.js

How to get the file-path of the currently executing javascript code

Categories

Resources