Remove unicode characters from Cheerio.js content

Remove unicode characters from Cheerio.js content - javascript

I’m using cheeriojs to scrape content off a webpage, with the following HTML.
<p>
Although the PM's office could neither confirm nor deny this, the spokesperson, John Doe said the meeting took place on Sunday.
<br>
<br>
“The outcome will be made public in due course,” John said in an SMS yesterday.
<br>
<br>
</p>
I’m able to reach the content of interest, by class and id tags, as follows:
$('.top-stories .line.more').each(function(i, el){
//Do something…
let content = $(this).next().html();
}
Once I’ve captured the content of interest, I “clean” it up using regular expressions, as below:
let cleanedContent = content.split(/<br>/).join(' \n ');
Inserting a newline where an empty tag (<br>) is matched. So far all is good, until I look at the cleaned content below:
Although the PM&apos;s office could neither confirm nor deny this, the spokesperson, Saima Shaanika said the meeting took place on Friday.
“The outcome will be made public in due course,”
It appears that punctuation marks, and perhaps some other characters, are stored according to their unicode codes. I may be wrong on this, and would welcome some correction to this line of thought.
Assuming that they are stored as unicode codes, is there a module that I could pass the “cleanedContent” variable, through to convert the unicodes to human readable punctuation marks/characters?
Should this not be possible, is there a better implementation of cheeriojs that would avoid this? I'm totally open to the notion that I'm not using cherriojs correctly, and would love some direction as to new approaches I could try instead.
One way I can think of, is writing a module containing several unicodes and their corresponding unicodes, then look for matches, and replace a matched code with the corresponding human readable character. I have some intuitive feeling that someone's already done this or something similar. I'd rather not try to reinvent the wheel.
Thanks in advance.

Cheerio uses htmlparser2 internally.
Because of this, you can use htmlparser2's decodeEntities option during the load of the HTML string, which allows you configure how HTML entities should be treated.
Example:
$ = cheerio.load('<ul id="fruits">...</ul>', {
decodeEntities: false
});
Relevant docs:
Cheerio
htmlparser2

Related

React/Javascript display API data shows hard coded links instead of hyperlink?

So I'm messing around with this API and for the description it has the links hard coded like Bunch of words, so it just shows exactly that on my browser. How would I display the description to look normal in my app?
Here is the API https://api.coingecko.com/api/v3/coins/bitcoin
and this is just a simple way I got the description to display
const Tokens = ({coin}) => {
return (
<p>{coin.description.en}</p>
)
}
This would end up showing all the a tags on the browser instead of converting them into a clickable link
Peercoin, Primecoin, and so on.\r\n\r\nThe cryptocurrency then took off with the innovation of the turing-complete smart contract by Ethereum which led to the development of other amazing projects such as EOS, Tron, and even crypto-collectibles such as CryptoKitties.",
Is there a way to display the description so that it looks like a normal paragraph with hyperlinks instead of literally showing the hard coded tags?
Also, if I only wanted to show like the first two sentences, how would I cut out the rest of the paragraph?

Like Jacob said, you can use the dangerouslySetInnerHTML prop by utilizing interpolated string literals.
const apiResponse = `Peercoin, Primecoin, and so on.\r\n\r\nThe cryptocurrency then took off with the innovation of the turing-complete smart contract by Ethereum which led to the development of other amazing projects such as EOS, Tron, and even crypto-collectibles such as CryptoKitties."`
...
<p dangerouslySetInnerHTML={{__html: apiResponse}}></p>.
However, \r and \n won't be understood by the HTML. You can replace these whitespace characters with unicode escape sequences that HTML will understand like so:
const cleanedAPIResponse = apiResponse.replace("\n", "<br\>").replace("\r", "\u000D");
...
<p dangerouslySetInnerHTML={{__html: cleanedAPIResponse}}></p>.
Note: Not too sure about these replacements. FYI, \r is known as a 'carriage return'.
If you wanted only the first two sentences, an idea could be to search for the second instance of '.' in the API response. Then you can truncate the rest of the string literal, and append the appropriate closing tags based on the appearance of the tags going from left to right and which do not have matching closing tags in the string already.

Parse JS code for comments

I have a small NodeJS program that I use to extract code comments from files I point it to. It mostly works, but I'm having some issues dealing with it misinterpreting certain JS strings (glob patterns) as code comments.
I'm using the regex [^:](\/\/.+)|(\/\*[\W\w\n\r]+?\*\/) to parse the following test file:
function DoStuff() {
/* This contains the value of foo.
Foo is used to display "foo"
via http://stackoverflow.com
*/
this.foo = "http://google.com";
this.protocolAgnosticUrl = "//cdnjs.cloudflare.com/ajax/libs/jquery/3.2.1/core.js";
//Show a message about foo
alert(this.foo);
/// This is a triple-slash comment!
const globPatterns = [
'path/to/**/*.tests.js',
'!my-file.js',
"!**/folder/*",
'another/path/**/*.tests.js'
];
}
Here's a live demo to help visualize what is and is not properly captured by the regex: https://regex101.com/r/EwYpQl/1
I need to be able to only locate the actual code comments here, and not the comment-like syntax that can sometimes appear within strings.

I have to agree with the comments that for most cases it is better to use a parser, even when a RegExp can do the job for a specific and well defined use case.
The problem is not that you can't make it work for that very specific use case even thought there are probably plenty of edge cases that you don't really care about, nor have to, but that may break that solution. The actual problem is that if you start building around your sub-optimal solution and your requirements evolve overtime, you will start to patch those as they appear. Someday, you may find yourself with an extensive codebase full of patches that doesn't scale anymore and the only solution will probably be to start from scratch.
Anyway, you have been warned by a few of us, and is still possible that your use case is really that simple and will not change in the future. I would still consider moving from RegExp to a parser at some point, but maybe you can use this meanwhile:
(^ +\/\/(.*))|(["'`]+.*["'`]+.*\/\/(.*))|(["'`]+.*["'`]+.*\/\*([\W\w\n\r]+?)\*\/)|(^ +\/\*([\W\w\n\r]+?)\*\/)
Just in case, I have added a few other cases, such as comments that come straight after some valid code:
Edit to prove the first point and what is being said in the comments:
I have just answered this with the previous RegExp that was solving just the issue that you pointed out in your question (your RegExp was misinterpreting strings containing glob patterns as code comments).
So, I fixed that and I even made it able to match comments that start in the same line as a valid (non-commented) statement. Just a moment after posting that I notice that this last feature will only work if that statement contains a string.
This is the updated version, but please, keep in mind that this is exactly what we are warning you about...:
(^[^"'`\n]+\/\/(.*))|(["'`]+.*["'`]+.*\/\/(.*))|(["'`]+.*["'`]+.*\/\*([\W\w\n\r]+?)\*\/)|(^[^"'`\n]+\/\*([\W\w\n\r]+?)\*\/)
How does it work?
There are 4 main groups that compose the whole RegExp, the first two for single-line comments and the next two for multi-line comments:
(^[^"'`\n]+//(.*))
(["']+.*["']+.//(.))
(["']+.*["']+.*/*([\W\w\n\r]+?)*/)
(^[^"'`\n]+/*([\W\w\n\r]+?)*/)
You will see there are some repeated patterns:
^[^"'`\n]+: From the start of a line, match anything that doesn't include any kind of quote or line break.
` is for ES2015 template literals.
Line breaks are excluded as well to prevent matching empty lines.
Note the + will prevent matching comments that are not padded with at least one space. You can try replacing it with *, but then it will match strings containing glob patterns again.
["']+.*["']+.*: This is matching anything that is between quotes, including anything that looks like a comment but it's part of a string. Whatever you match after, it will be outside that string, so using another group you can match comments.

How to detect and remove unwanted lines from a string?

I am working on a project in which i have to extract text data from a PDF.
I am able to extract text from the PDF, but extracted text sometimes contains lines which i would like to strip off from it.
Here's and example of unwanted lines -
ISBN 0-7225-3293-8. = CONTENTS = Part One Part Two Epilogue
Page 1 / 94
And, here's an example of good line (which i'd like to keep) -
Dusk was falling as the boy arrived with his herd at an abandoned church.
I wanted to sleep a little longer, he thought. He had had the same dream that night as a week ago
Different PDFs can give out different unwanted lines.
How can i detect them ?

Option 1 - Give the computer a rule: If you are able to narrow down what content it is that you would like to keep, the obvious criteria that sticks out to me is the exclusion of special characters, then you can filter your results based on this.
So let's say you agree that all "good lines" will be without special characters ('/', '-', and '=') for example, if a line DOES contain one of these items, you know you can remove it from the content you are keeping. This could be done in a for loop containing an if-then condition that looks something like this..
var lineArray = //code needed to make each line of the file an element of the array
For (cnt = 0; cnt < totalLines; cnt++)
{
var line = lineArray[cnt];
if (line.contains("/") || line.contains("-") || line.contains("="))
lineArray[cnt] = "";
}
At the end of this code you could simply get all the text within the array and it would no longer contain the unwanted lines. If there are unwanted lines however, that are virtually indistinguishable by characters, length, positioning etc. the previous approach begins to break down on some of the trickier lines.
This is because there is no rule you can give the computer to distinguish between the good and the bad without giving it a brain such as yours that recognizes parts of speech and sentence structure. In which case you might consider option 2, which is just that.
Option 2- Give the computer a brain: Given that the text you want to remove will more or less be incoherent documentation based on what you have shown us, an open source (or purchased) natural language processor may be what you are looking for.
I found a good beginner's intro at http://myreaders.info/10_Natural_Language_Processing.pdf with some information that might be of use to you. From the source,
"Linguistics is the science of language. Its study includes:
sounds (phonology),
word formation (morphology),
sentence structure (syntax),
meaning (semantics), and understanding (pragmatics) etc.
Syntactic Analysis : Here the analysis is of words in a sentence to know the grammatical structure of the sentence. The words are transformed into structures that show how the words relate to each others. Some word sequences may be rejected if they violate the rules of the language for how words may be combined. Example: An English syntactic analyzer would reject the sentence say : 'Boy the go the to store.' "
Using some sort of NLP, you can discover whether a given section of text contains a sentence or some incoherent rambling. This test could then be used as a filter in your program for what you would like to keep or remove.
Side note- As it appears your sample text is not just sentences but literature, sometimes characters will speak in sentence fragments as part of their nature given by the author. In this case, you could add a separate condition that if the text is contained within two quotations and has no special characters, you want to keep the text regardless.
In the end NLP may be more work than you require or that you want to do, in which case Option 1 is likely going to be your best bet. On the other hand, it may be just the thing you are looking for. Whatever the case or if you decide you need some combination of the two, best of luck! I hope this answer helps.

Is there a javascript module to format user text input into proper "english" format?

I am looking for a text formatter that will take user input from a textbox and add in newlines and stuff.
The current way I'm doing it is to push the text into a function and replace specific regex matches with their html counterparts:
function textformatter(text) {
var urlRegex =/[\n\r]/g;
return text.replace(urlRegex, function(text) {
return '<br/>';
});
}
I'd have to add a new regex match for each thing I want to replace i.e. italics, bolds, etc. Thought I'd try to find a module out there because this seems like a common thing to do and someone might've written something much better than what I'd write.
Tried to google but I guess my search words are incorrect?
Thanks, in advance.

I'm not exactly clear what you want the intended output to be? ("add in newlines and stuff"), but have you looked at markdown?
It's a light-weight markup language lets you write mostly in plain text and get formatted html as an output. All the markup is supposed to be plain-text readable (for example, bold is done with asterisks). In fact, StackOverflow comments/answers/questions are written in a markdown dialect! And you'll definitely be able to find a markdown converter for whatever language you want. Here's a js markdown parser.
In any case, it's hard to answer without knowing more about what you need. What is text in the function you're calling? Is it plain text from a textarea? Besides replacing new lines with <br>, what formatting do you want to be done?

As mentioned, Markdown would be the right thing to use for text, but if you simply want to escape unsafe characters, the javascript method encodeURIComponent could be used.
http://www.w3schools.com/jsref/jsref_encodeuricomponent.asp

Is there a way to automatically control orphaned words in an HTML document?

I was wondering if there's a way to automatically control orphaned words in an HTML file, possibly by using CSS and/or Javascript (or something else, if anyone has an alternative suggestion).
By 'orphaned words', I mean singular words that appear on a new line at the end of a paragraph. For example:
"This paragraph ends with an undesirable orphaned
word."
Instead, it would be preferable to have the paragraph break as follows:
"This paragraph no longer ends with an undesirable
orphaned word."
While I know that I could manually correct this by placing an HTML non-breaking space ( ) between the final two words, I'm wondering if there's a way to automate the process, since manual adjustments like this can quickly become tedious for large blocks of text across multiple files.
Incidentally, the CSS2.1 properties orphans (and widows) only apply to entire lines of text, and even then only for the printing of HTML pages (not to mention the fact that these properties are largely unsupported by most major browsers).
Many professional page layout applications, such as Adobe InDesign, can automate the removal of orphans by automatically adding non-breaking spaces where orphans occur; is there any sort of equivalent solution for HTML?

You can avoid orphaned words by replacing the space between the last two words in a sentence with a non-breaking space ( ).
There are plugins out there that does this, for example jqWidon't or this jquery snippet.
There are also plugins for popular frameworks (such as typogrify for django and widon't for wordpress) that essentially does the same thing.

I know you wanted a javascript solution, but in case someone found this page a solution but for emails (where Javascript isn't an option), I decided to post my solution.
Use CSS white-space: nowrap. So what I do is surround the last two or three words (or wherever I want the "break" to be) in a span, add an inline CSS (remember, I deal with email, make a class as needed):
<td>
I don't <span style="white-space: nowrap;">want orphaned words.</span>
</td>
In a fluid/responsive layout, if you do it right, the last few words will break to a second line until there is room for those words to appear on one line.
Read more about about the white-space property on this link: http://www.w3schools.com/cssref/pr_text_white-space.asp
EDIT: 12/19/2015 - Since this isn't supported in Outlook, I've been adding a non-breaking space between the last two words in a sentence. It's less code, and supported everywhere.
EDIT: 2/20/2018 - I've discovered that the Outlook App (iOS and Android) doesn't support the entity, so I've had to combine both solutions: e.g.:
<td>
I don't <span style="white-space:nowrap;">want orphaned words.</span>
</td>

In short, no. This is something that has driven print designers crazy for years, but HTML does not provide this level of control.
If you absolutely positively want this, and understand the speed implications, you can try the suggestion here:
detecting line-breaks with jQuery?
That is the best solution I can imagine, but that does not make it a good solution.

I see there are 3rd party plugins suggested, but it's simpler to do it yourself. if all you want to do is replace the last space character with a non-breaking space, it's almost trivial:
const unorphanize = (str) => {
let iLast = str.lastIndexOf(' ');
let stArr = str.split('');
stArr[iLast] = ' ';
return stArr.join('')
}
I suppose this may miss some unique cases but it's worked for all my use cases. the caveat is that you can't just plug the output in where text would go, you have to set innerHTML = unorphanize(text) or otherwise parse it

If you want to handle it yourself, without jQuery, you can write a javascript snippet to replace the text, if you're willing to make a couple assumptions:
A sentence always ends with a period.
You always want to replace the whitespace before the last word with
Assuming you have this html (which is styled to break right before "end" in my browser...monkey with the width if needed):
<div id="articleText" style="width:360px;color:black; background-color:Yellow;">
This is some text with one word on its own line at the end.
<p />
This is some text with one word on its own line at the end.
</div>
You can create this javascript and put it at the end of your page:
<script type="text/javascript">
reformatArticleText();
function reformatArticleText()
{
var div = document.getElementById("articleText");
div.innerHTML = div.innerHTML.replace(/\S(\s*)\./g, " $1.");
}
</script>
The regex simply finds all instances (using the g flag) of a whitespace character (\S) followed by any number of non-whitespace characters (\s) followed by a period. It creates a back-reference to the non-white-space that you can use in the replace text.
You can use a similar regex to include other end punctuation marks.

If third-party JavaScript is an option, one can use typogr.js, a JavaScript "typogrify" implementation. This particular filter is called, unsurprisingly, Widont.
<script src="https://cdnjs.cloudflare.com/ajax/libs/typogr/0.6.7/typogr.min.js"></script>
<script>
document.body.innerHTML = typogr.widont(document.body.innerHTML);
</script>
</body>

We Keep Coding

JavaScript is the programming language of the Web.

Remove unicode characters from Cheerio.js content - javascript

Related

React/Javascript display API data shows hard coded links instead of hyperlink?

Parse JS code for comments

How to detect and remove unwanted lines from a string?

Is there a javascript module to format user text input into proper "english" format?

Is there a way to automatically control orphaned words in an HTML document?

Categories

Resources