I want to get the right video snippet title that doesn't include special characters.
I am using the API:
https://www.googleapis.com/youtube/v3/search,
with the part snippet.
Currently, I am getting the snippet.title below:
I'M GONNA CARRY HER!!! Fortnite With Karina!
I expected this title instead:
I'm gonna carry her!!! Fortnite With Karina!
First, please acknowledge that what you've got from the API are not (quote from you) special characters.
To be technically precise, those sequence of characters are HTML character references, also known as HTML entities.
The behavior you've encountered is a well-known issue of the API, for which there's no other solution that I know of, except that you yourself have to substitute those HTML entities for the actual characters that they stand for.
Now, I recommend against an ad hoc solution; that is I do recommend you to employ well-written well-tested well-known libraries that derive their non-trivial solution from carefully implemented code conforming to the current HTML standard.
In my opinion, Mathias Bynens' library is evidently a tool that meets each of the criteria I mentioned above:
he
he (for “HTML entities”) is a robust HTML entity encoder/decoder written in JavaScript. It supports all standardized named character references as per HTML, handles ambiguous ampersands and other edge cases just like a browser would, has an extensive test suite, and — contrary to many other JavaScript solutions — he handles astral Unicode symbols just fine. An online demo is available.
I'm using escape-goat as it operates as either a standalone function or as a tagged template literal, depending on your use case:
const {htmlUnescape} = require('escape-goat');
htmlUnescape("I'M GONNA CARRY HER!!! Fortnite With Karina!");
//=> 'I'm gonna carry her!!! Fortnite With Karina!'
htmlUnescape`Title: ${"I'M GONNA CARRY HER!!! Fortnite With Karina!"}`;
//=> 'Title: I'm gonna carry her!!! Fortnite With Karina!'
When dealing with html encode/decode, always be wary of potential XSS exploitation.
If you want to use raw JS and not import a library, I saw something in my travels that works for the simple use case you presented. It basically is stripping out the separators to get at the integer that represents a Unicode-16 character. fromCharCode looks up that integer and returns the character that matches the integer you give it.
const unescape = (str) => {
return str.replace(/&#(\d+);/g, (match, dec) => String.fromCharCode(dec))
}
As Matt Hosch mentioned in his answer, you'd want to sanitize any data you receive to prevent an XSS.
Related
I'm refactoring a rather large RegExp into a function that returns a RegExp. As a backward-compatibility test, I compared the .source of the returned RegExp with the .source of the old RegExp:
getRegExp(/* in the case requiring backward compatibility there's no arguments */)
.source == oldRegExp.source
However, I've noticed that the old RegExp contains various excessive backslashes like [\.\w] instead of [.\w]. I'd like to refactor such bits, but there's a number of them and it would be nice to have a similar check (backward compability is not broken). The problem is, /[\.\w]/.source != /[.\w]/.source. And identifying which backslashes may be removed automatically is not trivial (\. and . are not the same outside [...] and may be in some other cases).
Are you aware of somewhat simple ways to do so? It seems this can only be done by actual parsing of the .source (compare the example above with /\[\.\w]\/ and /\[.\w]\/), but may be I'm missing some trick of utilizing browser's built-in properties/methods. The point is, '\"' == '"' is true, so strings defined with these different syntaxes are stored as "normalized" values ("), I wonder if such "normalized" pattern is available for a RegExp.
Sadly, comparing two regular expressions to see if they're the same is exactly the same as comparing any other two pieces of code - ie, hard.
The only real way I know of to do this is to create a suite of tests, each one targeting a specific aspect of the regular expression and verifying that it works properly. This is not an easy process-regular expressions are subtle and complex with a lot of potential for unrealized side effects. I recently had to fix some defects in a regex based address parser and it took about a thousand unit tests before I was satisfied with my coverage... but then as soon as I started to change the regex MY TESTS CAUGHT STUFF CONSTANTLY!!
Unit testing sucks and it's just tiring and not fun, but for almost any piece of logic it has real value, and when using powerful tools like regex, I would say it's absolutely crucial.
Is it save to write JavaScript source code (to be executed in the browser) which includes UTF-8 character literals?
For example, I would like to use an ellipses literal in a string as such:
var foo = "Oops… Something went wrong";
Do "modern" browsers support this? Is there a published browser support matrix somewhere?
JavaScript is by specification a Unicode language, so Unicode characters in strings should be safe. You can use hex escapes (\u8E24) as an alternative. Make sure your script files are served with proper content type headers.
Note that characters beyond one- and two-byte sequences are problematic, and that JavaScript regular expressions are terrible with characters beyond the first codepage. (Well maybe not "terrible", but primitive at best.)
You can also use Unicode letters, Unicode combining marks, and Unicode connector punctuation characters in identifiers, in case you want to impress your friends. Thus
var wavy﹏line = "wow";
is perfectly good JavaScript (but good luck with your bug report if you find a browser where it doesn't work).
Read all about it in the spec, or use it to fall asleep at night :)
I'm looking for an efficient way to take a JavaScript string and return all of the scripts which occur in that string.
Full UTF-16 including the "astral" plane / non-BMP characters which require surrogate pairs must be correctly handled. This is possibly the main problem since JavaScript is not UTF-16 aware.
It only has to deal with codepoints so no fancy awareness of complex scripts or grapheme clusters is necessary. (This will be obvious to some of you anyway.)
Example:
stringToIso15924("παν語");
would return something like:
[ "Grek", "Hani" ]
I'm using node.js and some Unicode libraries such as XRegExp and unorm already so I don't mind adding other libraries that might already handle or ease such a feature.
I'm not aware of a JavaScript library that can look up character properties such as script codes, so this is probably the second part of the problem.
The third part of the problem is just to avoid inefficiencies.
I answered a similar question, well at least related. In this pastebin you will a (looooong) function that returns the script name for a character. It should be easy to modifiy it to accommodate a string.
there is a way to use patterns like "\p{L}" in javascript, natively?
(i suppose that is a perl-compatible syntax)
I'm interested firstly in firefox support, and webkit, possibly
No, \p{..} is not supported natively by any of the big browsers. However, it does work in JavaScript if you use the XRegExp library and it's Unicode plugins.
Unfortunately, no. You can only specify a set of characters in the usual syntax, writing characters and ranges in brackets, but this becomes awkward since e.g. letters are scattered all around the Unicode space, with other characters between them.
There’s an inefficient workaround: fetch the UnicodeData.txt file from the Unicode site, put its content inside your JavaScript code as data, and parse it. And then you could have the data e.g. in an array of objects containing the Unicode properties, such as gc (General Category), which tells you whether the character is a letter or not. But even then, you would just have the data handy for simple testing, not as something you can use as a constituent of a regexp.
In theory, you could use the data to construct a regexp... but it would be rather large.
No, Javascript has slightly different syntax. To catch unicode you have to use character selector like \uXXXX. However, on practice if your page and files in UTF-8, setting non-ASCII characters in range [абвг] does work too.
http://www.javascriptkit.com/jsref/regexp.shtml
The library found here:
http://inimino.org/~inimino/blog/javascript_cset
seems to work for me and is fairly small and independent of other libraries.
Im trying to get a javascript regex that matches x opening braces, then x closing braces, while allowing them to be nested in-between each other.
For example, it would match:
"{ a { q } }"
but not
"{ a { q } { }"
or
"{ } } { } {"
That being said, I have no idea how to do it with regexpes, or if it's even possible.
The short answer to this is no. Regular expressions are a non-context-free grammar, so it cannot be done with true regex. You can, however, look for specific (non-arbitrary) nesting patterns.
http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx
The recursion problem here is, at its heart, the same reason you can't correctly parse HTML with regex. Like XML, the construct you've described is a context-free grammar; note its close similarity with the first example from the Wikipedia article.
I've heard there are engines out there that extend regex to offer support for arbitrarily nested elements, but this would make them something other than true regex. Anyway, I don't know of any such libraries for JavaScript. I think what you want is some kind of string-manipulation-based parser.
AFAIK, uou can’t really do this with regular expressions only.
However, Javascript’s String.replace method does have a nice feature that could allow you some level of recursion. If you pass a function as the second parameter, that function will be called for each match encountered. You could then perform the same replace on that match, passing along the same function, which would be called for each match inside that match, etc.
I’m too tired right now to write up an example that fits what you’re asking for — or even if it’s actually possible, so I’ll leave it at this possible hint, and further working out as an exercise to the reader.
That is not possible to do with real regular expression, and even with full-blown PCRE the "counting problem" that you're describing is an example of something that you just can't do.
An old textbook I had in school said, "regular expressions can't count." That's not true of modern "supercharged" regular expression implementations with the "{n,m}" qualifiers, but note that the values in curly braces there are constants.
To do that, you need a more complicated automaton. Context-free grammars can represent languages like you describe, as can parse expression grammars.
Yes, it's probably possible with Regexes. No, it isn't possible in Javascript Regexes. Yes, it's probably possible in .NET Regexes for example (Balancing Groups http://msdn.microsoft.com/en-us/library/bs2twtah(v=vs.71).aspx ). No, I don't know how to do them. They give me migraine (and I'm not kidding here). They are quite extreme voodoo.