Converting multiple emojis into a single character [duplicate] - javascript

I'm not sure what this is called so I'm having trouble searching for it. How can I decode a string with unicode from http\u00253A\u00252F\u00252Fexample.com to http://example.com with JavaScript? I tried unescape, decodeURI, and decodeURIComponent so I guess the only thing left is string replace.
EDIT: The string is not typed, but rather a substring from another piece of code. So to solve the problem you have to start with something like this:
var s = 'http\\u00253A\\u00252F\\u00252Fexample.com';
I hope that shows why unescape() doesn't work.

Edit (2017-10-12):
#MechaLynx and #Kevin-Weber note that unescape() is deprecated from non-browser environments and does not exist in TypeScript. decodeURIComponent is a drop-in replacement. For broader compatibility, use the below instead:
decodeURIComponent(JSON.parse('"http\\u00253A\\u00252F\\u00252Fexample.com"'));
> 'http://example.com'
Original answer:
unescape(JSON.parse('"http\\u00253A\\u00252F\\u00252Fexample.com"'));
> 'http://example.com'
You can offload all the work to JSON.parse

UPDATE: Please note that this is a solution that should apply to older browsers or non-browser platforms, and is kept alive for instructional purposes. Please refer to #radicand 's answer below for a more up to date answer.
This is a unicode, escaped string. First the string was escaped, then encoded with unicode. To convert back to normal:
var x = "http\\u00253A\\u00252F\\u00252Fexample.com";
var r = /\\u([\d\w]{4})/gi;
x = x.replace(r, function (match, grp) {
return String.fromCharCode(parseInt(grp, 16)); } );
console.log(x); // http%3A%2F%2Fexample.com
x = unescape(x);
console.log(x); // http://example.com
To explain: I use a regular expression to look for \u0025. However, since I need only a part of this string for my replace operation, I use parentheses to isolate the part I'm going to reuse, 0025. This isolated part is called a group.
The gi part at the end of the expression denotes it should match all instances in the string, not just the first one, and that the matching should be case insensitive. This might look unnecessary given the example, but it adds versatility.
Now, to convert from one string to the next, I need to execute some steps on each group of each match, and I can't do that by simply transforming the string. Helpfully, the String.replace operation can accept a function, which will be executed for each match. The return of that function will replace the match itself in the string.
I use the second parameter this function accepts, which is the group I need to use, and transform it to the equivalent utf-8 sequence, then use the built - in unescape function to decode the string to its proper form.

Note that the use of unescape() is deprecated and doesn't work with the TypeScript compiler, for example.
Based on radicand's answer and the comments section below, here's an updated solution:
var string = "http\\u00253A\\u00252F\\u00252Fexample.com";
decodeURIComponent(JSON.parse('"' + string.replace(/\"/g, '\\"') + '"'));
http://example.com

Using JSON.decode for this comes with significant drawbacks that you must be aware of:
You must wrap the string in double quotes
Many characters are not supported and must be escaped themselves. For example, passing any of the following to JSON.decode (after wrapping them in double quotes) will error even though these are all valid: \\n, \n, \\0, a"a
It does not support hexadecimal escapes: \\x45
It does not support Unicode code point sequences: \\u{045}
There are other caveats as well. Essentially, using JSON.decode for this purpose is a hack and doesn't work the way you might always expect. You should stick with using the JSON library to handle JSON, not for string operations.
I recently ran into this issue myself and wanted a robust decoder, so I ended up writing one myself. It's complete and thoroughly tested and is available here: https://github.com/iansan5653/unraw. It mimics the JavaScript standard as closely as possible.
Explanation:
The source is about 250 lines so I won't include it all here, but essentially it uses the following Regex to find all escape sequences and then parses them using parseInt(string, 16) to decode the base-16 numbers and then String.fromCodePoint(number) to get the corresponding character:
/\\(?:(\\)|x([\s\S]{0,2})|u(\{[^}]*\}?)|u([\s\S]{4})\\u([^{][\s\S]{0,3})|u([\s\S]{0,4})|([0-3]?[0-7]{1,2})|([\s\S])|$)/g
Commented (NOTE: This regex matches all escape sequences, including invalid ones. If the string would throw an error in JS, it throws an error in my library [ie, '\x!!' will error]):
/
\\ # All escape sequences start with a backslash
(?: # Starts a group of 'or' statements
(\\) # If a second backslash is encountered, stop there (it's an escaped slash)
| # or
x([\s\S]{0,2}) # Match valid hexadecimal sequences
| # or
u(\{[^}]*\}?) # Match valid code point sequences
| # or
u([\s\S]{4})\\u([^{][\s\S]{0,3}) # Match surrogate code points which get parsed together
| # or
u([\s\S]{0,4}) # Match non-surrogate Unicode sequences
| # or
([0-3]?[0-7]{1,2}) # Match deprecated octal sequences
| # or
([\s\S]) # Match anything else ('.' doesn't match newlines)
| # or
$ # Match the end of the string
) # End the group of 'or' statements
/g # Match as many instances as there are
Example
Using that library:
import unraw from "unraw";
let step1 = unraw('http\\u00253A\\u00252F\\u00252Fexample.com');
// yields "http%3A%2F%2Fexample.com"
// Then you can use decodeURIComponent to further decode it:
let step2 = decodeURIComponent(step1);
// yields http://example.com

I don't have enough rep to put this under comments to the existing answers:
unescape is only deprecated for working with URIs (or any encoded utf-8) which is probably the case for most people's needs. encodeURIComponent converts a js string to escaped UTF-8 and decodeURIComponent only works on escaped UTF-8 bytes. It throws an error for something like decodeURIComponent('%a9'); // error because extended ascii isn't valid utf-8 (even though that's still a unicode value), whereas unescape('%a9'); // © So you need to know your data when using decodeURIComponent.
decodeURIComponent won't work on "%C2" or any lone byte over 0x7f because in utf-8 that indicates part of a surrogate. However decodeURIComponent("%C2%A9") //gives you © Unescape wouldn't work properly on that // © AND it wouldn't throw an error, so unescape can lead to buggy code if you don't know your data.

This is not an answer to this exact question, but for those who are hitting this page via a search result and who are trying to (like I was) construct a single Unicode character given a sequence of escaped codepoints, note that you can pass multiple arguments to String.fromCodePoint() like so:
String.fromCodePoint(parseInt("1F469", 16), parseInt("200D", 16), parseInt("1F4BC", 16)) // 👩‍💼
You can of course parse your string to extract the hex codepoint strings and then do something like:
let codePoints = hexCodePointStrings.map(s => parseInt(s, 16));
let str = String.fromCodePoint(...codePoints);

In my case, I was trying to unescape HTML file sth like
"\u003Cdiv id=\u0022app\u0022\u003E\r\n \u003Cdiv data-v-269b6c0d\u003E\r\n \u003Cdiv data-v-269b6c0d class=\u0022menu\u0022\u003E\r\n \u003Cdiv data-v-269b6c0d class=\u0022faux_column\u0022\u003E\r\n \u003Cdiv data-v-269b6c0d class=\u0022row\u0022\u003E\r\n \u003Cdiv data-v-269b6c0d class=\u0022col-md-12\u0022\u003E\r\n"
to
<div id="app">
<div data-v-269b6c0d>
<div data-v-269b6c0d class="menu">
<div data-v-269b6c0d class="faux_column">
<div data-v-269b6c0d class="row">
<div data-v-269b6c0d class="col-md-12">
Here below works in my case:
const jsEscape = (str: string) => {
return str.replace(new RegExp("'", 'g'),"\\'");
}
export const decodeUnicodeEntities = (data: any) => {
return unescape(jsEscape(data));
}
// Use it
const data = ".....";
const unescaped = decodeUnicodeEntities(data); // Unescaped html

Related

jquery html() method does not render emoticons properly [duplicate]

I'm not sure what this is called so I'm having trouble searching for it. How can I decode a string with unicode from http\u00253A\u00252F\u00252Fexample.com to http://example.com with JavaScript? I tried unescape, decodeURI, and decodeURIComponent so I guess the only thing left is string replace.
EDIT: The string is not typed, but rather a substring from another piece of code. So to solve the problem you have to start with something like this:
var s = 'http\\u00253A\\u00252F\\u00252Fexample.com';
I hope that shows why unescape() doesn't work.
Edit (2017-10-12):
#MechaLynx and #Kevin-Weber note that unescape() is deprecated from non-browser environments and does not exist in TypeScript. decodeURIComponent is a drop-in replacement. For broader compatibility, use the below instead:
decodeURIComponent(JSON.parse('"http\\u00253A\\u00252F\\u00252Fexample.com"'));
> 'http://example.com'
Original answer:
unescape(JSON.parse('"http\\u00253A\\u00252F\\u00252Fexample.com"'));
> 'http://example.com'
You can offload all the work to JSON.parse
UPDATE: Please note that this is a solution that should apply to older browsers or non-browser platforms, and is kept alive for instructional purposes. Please refer to #radicand 's answer below for a more up to date answer.
This is a unicode, escaped string. First the string was escaped, then encoded with unicode. To convert back to normal:
var x = "http\\u00253A\\u00252F\\u00252Fexample.com";
var r = /\\u([\d\w]{4})/gi;
x = x.replace(r, function (match, grp) {
return String.fromCharCode(parseInt(grp, 16)); } );
console.log(x); // http%3A%2F%2Fexample.com
x = unescape(x);
console.log(x); // http://example.com
To explain: I use a regular expression to look for \u0025. However, since I need only a part of this string for my replace operation, I use parentheses to isolate the part I'm going to reuse, 0025. This isolated part is called a group.
The gi part at the end of the expression denotes it should match all instances in the string, not just the first one, and that the matching should be case insensitive. This might look unnecessary given the example, but it adds versatility.
Now, to convert from one string to the next, I need to execute some steps on each group of each match, and I can't do that by simply transforming the string. Helpfully, the String.replace operation can accept a function, which will be executed for each match. The return of that function will replace the match itself in the string.
I use the second parameter this function accepts, which is the group I need to use, and transform it to the equivalent utf-8 sequence, then use the built - in unescape function to decode the string to its proper form.
Note that the use of unescape() is deprecated and doesn't work with the TypeScript compiler, for example.
Based on radicand's answer and the comments section below, here's an updated solution:
var string = "http\\u00253A\\u00252F\\u00252Fexample.com";
decodeURIComponent(JSON.parse('"' + string.replace(/\"/g, '\\"') + '"'));
http://example.com
Using JSON.decode for this comes with significant drawbacks that you must be aware of:
You must wrap the string in double quotes
Many characters are not supported and must be escaped themselves. For example, passing any of the following to JSON.decode (after wrapping them in double quotes) will error even though these are all valid: \\n, \n, \\0, a"a
It does not support hexadecimal escapes: \\x45
It does not support Unicode code point sequences: \\u{045}
There are other caveats as well. Essentially, using JSON.decode for this purpose is a hack and doesn't work the way you might always expect. You should stick with using the JSON library to handle JSON, not for string operations.
I recently ran into this issue myself and wanted a robust decoder, so I ended up writing one myself. It's complete and thoroughly tested and is available here: https://github.com/iansan5653/unraw. It mimics the JavaScript standard as closely as possible.
Explanation:
The source is about 250 lines so I won't include it all here, but essentially it uses the following Regex to find all escape sequences and then parses them using parseInt(string, 16) to decode the base-16 numbers and then String.fromCodePoint(number) to get the corresponding character:
/\\(?:(\\)|x([\s\S]{0,2})|u(\{[^}]*\}?)|u([\s\S]{4})\\u([^{][\s\S]{0,3})|u([\s\S]{0,4})|([0-3]?[0-7]{1,2})|([\s\S])|$)/g
Commented (NOTE: This regex matches all escape sequences, including invalid ones. If the string would throw an error in JS, it throws an error in my library [ie, '\x!!' will error]):
/
\\ # All escape sequences start with a backslash
(?: # Starts a group of 'or' statements
(\\) # If a second backslash is encountered, stop there (it's an escaped slash)
| # or
x([\s\S]{0,2}) # Match valid hexadecimal sequences
| # or
u(\{[^}]*\}?) # Match valid code point sequences
| # or
u([\s\S]{4})\\u([^{][\s\S]{0,3}) # Match surrogate code points which get parsed together
| # or
u([\s\S]{0,4}) # Match non-surrogate Unicode sequences
| # or
([0-3]?[0-7]{1,2}) # Match deprecated octal sequences
| # or
([\s\S]) # Match anything else ('.' doesn't match newlines)
| # or
$ # Match the end of the string
) # End the group of 'or' statements
/g # Match as many instances as there are
Example
Using that library:
import unraw from "unraw";
let step1 = unraw('http\\u00253A\\u00252F\\u00252Fexample.com');
// yields "http%3A%2F%2Fexample.com"
// Then you can use decodeURIComponent to further decode it:
let step2 = decodeURIComponent(step1);
// yields http://example.com
I don't have enough rep to put this under comments to the existing answers:
unescape is only deprecated for working with URIs (or any encoded utf-8) which is probably the case for most people's needs. encodeURIComponent converts a js string to escaped UTF-8 and decodeURIComponent only works on escaped UTF-8 bytes. It throws an error for something like decodeURIComponent('%a9'); // error because extended ascii isn't valid utf-8 (even though that's still a unicode value), whereas unescape('%a9'); // © So you need to know your data when using decodeURIComponent.
decodeURIComponent won't work on "%C2" or any lone byte over 0x7f because in utf-8 that indicates part of a surrogate. However decodeURIComponent("%C2%A9") //gives you © Unescape wouldn't work properly on that // © AND it wouldn't throw an error, so unescape can lead to buggy code if you don't know your data.
This is not an answer to this exact question, but for those who are hitting this page via a search result and who are trying to (like I was) construct a single Unicode character given a sequence of escaped codepoints, note that you can pass multiple arguments to String.fromCodePoint() like so:
String.fromCodePoint(parseInt("1F469", 16), parseInt("200D", 16), parseInt("1F4BC", 16)) // 👩‍💼
You can of course parse your string to extract the hex codepoint strings and then do something like:
let codePoints = hexCodePointStrings.map(s => parseInt(s, 16));
let str = String.fromCodePoint(...codePoints);
In my case, I was trying to unescape HTML file sth like
"\u003Cdiv id=\u0022app\u0022\u003E\r\n \u003Cdiv data-v-269b6c0d\u003E\r\n \u003Cdiv data-v-269b6c0d class=\u0022menu\u0022\u003E\r\n \u003Cdiv data-v-269b6c0d class=\u0022faux_column\u0022\u003E\r\n \u003Cdiv data-v-269b6c0d class=\u0022row\u0022\u003E\r\n \u003Cdiv data-v-269b6c0d class=\u0022col-md-12\u0022\u003E\r\n"
to
<div id="app">
<div data-v-269b6c0d>
<div data-v-269b6c0d class="menu">
<div data-v-269b6c0d class="faux_column">
<div data-v-269b6c0d class="row">
<div data-v-269b6c0d class="col-md-12">
Here below works in my case:
const jsEscape = (str: string) => {
return str.replace(new RegExp("'", 'g'),"\\'");
}
export const decodeUnicodeEntities = (data: any) => {
return unescape(jsEscape(data));
}
// Use it
const data = ".....";
const unescaped = decodeUnicodeEntities(data); // Unescaped html

How to remove this character only if it is not followed by 'u'?

I have this little method in some JavaScript classes that communicates with server through php, using JSON:
formUpload.prototype.cleanResult = function(dirtResult){
var res = dirtResult.slice(dirtResult.indexOf("{"), dirtResult.lastIndexOf("}")+1);
res = res.replace(/\\/g, "");
return res;
}
First, I get only the JSON (sometimes there is warnings in text result). Than I unescape (remove backslashes) because sometimes the JSON is complex and some backslash generate errors when parsing in JavaScript. I try to get rid with those backslashes, but I'm not sure if it is even possible. Meanwhile, the Regular Expression remove when receiving the response.
The problem is when there is a special character, it is encoded in utf-8 (\uXXXX), and if this backslash is removed, the character will not be recognized, the result will be uXXXX in the text.
So I need a regular expression that removes the backslashes when there isn't a "u" after, gut that's beyond my knowledge so far...
Also, a good tutorial will be cool!
EDIT: here is a simple response:
{"erro":"dir","msg":"N\u00e3o existe o diret\u00f3rio e n\u00e3o foi poss\u00edvel cri\u00e1-lo.","descr":"dir:/Library/WebServer/Documents/www/sintran/fotos"}
I don't have an example when undesired backslashes appears, but is related to single quotes, multidimensional arrays, etc..
You want to use a negative lookahead. From regular-expressions.info
Negative lookahead is indispensable if you want to match something not
followed by something else.
Sample code
formUpload.prototype.cleanResult = function(dirtResult){
return dirtResult.replace(/\\(?!u)/g, '');
}
Regular expression
\\(?!u)
Description
Demo
http://regex101.com/r/qR2hA0

Using RegEx for javascript string parameter encoding

Summary
Can you use a regular expression to match multiple characters, but replace individual characters with specific replacements.
For instance, replace \ with \\ and replace " with \x22 and replace ' with \x27.
It is my understanding that this is simply not possible, as you can use the captured sub-matches within the expression, but not with any level of logic that would allow you to conditionally output text if a sub-match took place.
The following VB.NET code is obviously totally incorrect, but gives you an idea of my thinking... (i.e. if there was a replacement command that allowed you to say "if sub-match 1 happened, then output \\ instead")
RegEx.Replace(text, "(\)?("")?(')?", "{if($1,'\\')}{if($2,'\x22')}{if($2,'\x27')}")
(This would be for use with .NET RegEx class, but would be useful for use with javascript RegExp class)
Background
More for interest than actual need, but I've been playing with encoding text for use within javascript parameters. (Well, the need is certainly there, but the interest is efficiency.)
I've been using the standard String.Replace, and doing some tests for performance with the following two functions...
Public Function GetJSSafeString(ByVal text As String) As String
Return text.Replace("\", "\\").Replace("""", "\x22").Replace("'", "\x27")
End Function
Public Function GetJSSafeString2(ByVal text As String) As String
If text.Contains("\") Then
text = text.Replace("\", "\\")
End If
If text.Contains("""") Then
text = text.Replace("""", "\x22")
End If
If text.Contains("'") Then
text = text.Replace("'", "\x27")
End If
Return text
End Function
Using two strings, both around 200 characters in length - the first does not contain any characters to be converted - the second contains one of each character to be converted (\"'). I ran each of the two strings through the two functions 100000 times each.
The four results are coming out (in total-milliseconds) roughly as...
GetJSSafeString, no converted characters: 182.0364
GetJSSafeString, converted characters: 316.0632
GetJSSafeString2, no converted characters: 60.012
GetJSSafeString2, converted characters: 354.0708
So obviously GetJSSafeString2 is best if there are no replacement, and worst if there are characters to convert (but not much worse, so looks like the better choice).
But it got me thinking... could this be done with a single regular expression?
And if so, would it be faster than either of the two above functions?
The solution in JavaScript:
var text="this is a test \\ with \"things\" to ' replace";
var h={'\\':'\\\\', '"':"\\x22", "'":"\\x27"}; //we define here the replacements
text=text.replace(/("|\\|')/g,function(match){return h[match]});
alert(text); //prints: this is a test \\ with \x22things\x22 to \x27 replace
Note: this document on replace is worth reading
Big thanks to #psxls for his answer, which will be useful for future javascript implementation.
His answer made me look at the overloads for the .NET RegEx.Replace function (which to be honest, I should have done in the first place, my bad)... and there is a MatchEvaluator delegate.
So I have implemented the following code as a test (to compliment the code already in my answer)...
Public Function GetJSSafeString3(ByVal text As String) As String
Return Regex.Replace(text, "(\\|""|')", New MatchEvaluator(AddressOf GetJSSafeString3Eval))
End Function
Public Function GetJSSafeString3Eval(ByVal textMatch As Match) As String
Select Case textMatch.Value
Case "\"
Return "\\"
Case """"
Return "\x22"
Case "'"
Return "\x27"
End Select
Return ""
End Function
And the results are as I expected... that this is far, far less efficient than either of the functions in my original question function. (The following are in milliseconds)
GetJSSafeString, no converted characters: 182
GetJSSafeString, converted characters: 316
GetJSSafeString2, no converted characters: 60
GetJSSafeString2, converted characters: 354
GetJSSafeString3, no converted characters: 477
GetJSSafeString3, converted characters: 856
As the majority of the strings that I will be converting will not contain any of the characters mentioned, I am implementing the GetJSSafeString*2* function, as that is by far the most efficient for the majority of situations.

RegExp for remove first and last char and turn ending double slashes into single

I have the following Javascript code to obtain the inner string from an RegExp:
Function.prototype.method = function (name,func){
this.prototype[name] = func;
return this;
};
RegExp.method('toRawString', function(){
return this.toString().replace(/^.(.*).$/,"$1");
});
The purpose of this, is to avoid in string double quoting. For example, if you have a Windows file path "C:\My Documents\My Folder\MyFile.file", you can use it like the following:
alert(/C:\My Documents\My Folder\MyFile.file/.toRawString());
However it is not working for ""C:\My Documents\My Folder\" since it causes syntax error. The only way to avoid it is to keep double quoting at the end of the string. Thus it will be written
alert(/C:\My Documents\My Folder\\/.toRawString());
The fact is any odd number of back slashes on the end of the string will be an error, so all ending back slashes must be double escaped. It will not be hard to use a multiple line small implementation, but are there any single RegExp solution?
NOTE
When using toRawString the RegExp object for this is usually NOT going to be used for any other purpose except for that method. I just want to use the syntax of RegExp to avoid double back slashes in source code. Unfortunately the ending double slashes cannot be easily avoid. I think another workaround is to force a space at the end but that is another question then.
UPDATE
I finally solved the "another question" and posted the code here.
OK, I get what you're trying to do! It's hacky : )
Try something like:
return this.toString().slice(1, -1).replace(/\\+$/, '\\')
Hope that helps.
If you want to include the double quotes in the string just wrap it with single quotes.
s = '"C:\\My Documents\\My Folder\\MyFile.file"'
console.log(s) // Output => "C:\My Documents\My Folder\MyFile.file"
This produces a syntax error:
/C:\My Documents\/
But that regular expression could be written correctly like this:
/C:\\My Documents\\/
Or like this:
new RegExp("C:\\\\My Documents\\\\")
I think your function is just fine and is returning a correct result. Regular expressions just can't end with an unpaired backslash. It's not that you're double escaping - you're just escaping the escape character.
This would produce an error too:
new RegExp("C:\\My Documents\\")
A regular expression like this, for instance, can't be written without a pair of backslashes:
/C:\\What/
Without the second backslash, \W would be interpreted as a special character escape sequence. So escaping the escape character isn't only necessary at the end. It's required anywhere it might be interpreted as the beginning of an escape sequences. For that reason, it might be a good rule of thumb to always use two backslashes to indicate a backslash literal in a regular expression.

JS/Jquery, Match not finding the PNG = match('/gif|jpg|jpeg|png/')

I have the following code which I use to match fancybox possible elements:
$('a.grouped_elements').each(function(){
var elem = $(this);
// Convert everything to lower case to match smart
if(elem.attr('href').toLowerCase().match('/gif|jpg|jpeg|png/') != null) {
elem.fancybox();
}
});
It works great with JPGs but it isn't matching PNGs for some reason. Anyone see a bug with the code?
Thanks
A couple of things.
Match accepts an object of RegExp, not a string. It may work in some browsers, but is definitely not standard.
"gif".match('/gif|png|jpg/'); // null​​​​​​​​​​​​​​​​​​​​​​​​​​​​
Without the strings
"gif".match(/gif|png|jpg/); // ["gif"]
Also, you would want to check these at the end of a filename, instead of anywhere in the string.
"isthisagif.nope".match(/(gif|png|jpg|jpeg)/); // ["gif", "gif"]
Only searching at the end of string with $ suffix
"isthisagif.nope".match(/(gif|png|jpg|jpeg)$/); // null
No need to make href lowercase, just do a case insensitive search /i.
Look for a dot before the image extension as an additional check.
And some tests. I don't know how you got any results back with using a string argument to .match. What browser are you on?
I guess the fact that it'll match anywhere in the string (it would match "http://www.giftshop.com/" for instance) could be considered a bug. I'd use
/\.(gif|jpe?g|png)$/i
You are passing a string to the match() function rather than a regular expression. In JavaScript, strings are delimited with single quotes, and regular expressions are delimited with forward slashes. If you use both, you have a string, not a regex.
This worked perfectly for me: /.+\.(gif|png|jpe?g)$/i
.+ -> any string
\. -> followed by a point.
(gif|png|jpe?g) -> and then followed by any of these extensions. jpeg may or may not have the letter e.
$ -> now the end of the string it's expected
/i -> case insensitive mode: matches both sflkj.JPG and lkjfsl.jpg

Categories