Why does this work:
<p id="emoji">π</p>
And this doesn't:
document.getElementById("emoji").innerHTML = String.fromCharCode(parseInt('1f604', 16));
A 'char' in JS terms is actually a UTF-16 code unit, not a full Unicode character. (This sad state of affairs stems from ancient times when there wasn't a difference*.) To use a character outside of the Basic Multilingual Plane you have to write it in the UTF-16-encoded form of a surrogate pair of two 16-bit code units:
String.fromCharCode(0xD83D, 0xDE04)
In ECMAScript 6 we will get some interfaces that let us deal with strings as if they were full Unicode code points, though they are incomplete and are only a façade over the String type which is still stored as a code unit sequence. Then we'll be able to do:
String.fromCodePoint(0x1F604)
See this question for some polyfill code to let you use this feature in today's browsers.
(*: When I get access to a time machine I'm leaving Hitler alone and going back to invent UTF-8 earlier. UTF-16 must never have been!)
You can also use the hacky method if you don't want to include String.fromCodePoint() in your code. It consists in creating a virtual element ...
elem=document.createElement('p')
... Filling it with the Working HTML...
elem.innerHTML = "😄"
... And finally getting its value
value = elem.innerHTML
To make it short, this works because of the fact that, as soon as you set the value of a HTML container, the value gets converted into the corresponding character.
Hope I could help.
Related
I tried to get Unicode of the "π€" emoji with javascript but not appear when I try to add it as HTML, a lot of emojis not appear
var x="A".charCodeAt(0);
document.write(x); // gaves 65
A gave A
var y="π€".charCodeAt(0);
document.write(y); // gaves 55358
? didn't give π€
What is the reason?
Characters (or code points) in JavaScript are made up of code units. Sometimes, a code point is made up of one code unit. For example, the code point for "a" is made up of one code unit (97 or 0x61). Other code points however are made up of two code units (known as a surrogate pair) that work together to form a single character. For example, the Thinking Face emoji (π€) you shared in your question is made up of two code units which when paired together make the one code point to form the thinking face emoji. You can see this by taking the .length of it:
console.log("π€".length); // 2 (as made up of two code points)
console.log("π€".charCodeAt(0)); // 55358 or 0xD83E
console.log("π€".charCodeAt(1)); // 56596 or 0xDD14
When you use .charCodeAt() you're accessing the code units of your string elements. As seen above, "π€" comprises of two code units, 0xD83E (55358) and 0xDD14 (56596). The first code unit returned by .charCodeAt(0) only makes sense when used together with other code units, and so it doesn't work standalone, which is why you're getting the replacement character symbol (οΏ½) in your result:
console.log("π€"[0], "π€"[1]); // οΏ½ οΏ½
Rather than using the individual code units obtained by .chartCodeAt(), you can show the entire code point (ie: character), which can be obtained using .codePointAt():
console.log("π€".codePointAt(0)); // 129300 or 0x1F914
Then, you can use this number as a HTML entity:
<p>π€</p>
This emoji: π€ has a decimal (dec) reference of 129300. If you want it to show in your HTML try this:
<p>π€</p>
For a full list of dec references visit w3schools' website.
I'm trying to use a powershell regex to pull version data from the AssemblyInfo.cs file. The regex below is my best attempt, however it only pulls the string [assembly: AssemblyVersion(". I've put this regex into a couple web regex testers and it LOOKS like it's doing what I want, however this is my first crack at using a powershell regex so I could be looking at it wrong.
$s = '[assembly: AssemblyVersion("1.0.0.0")]'
$prog = [regex]::match($s, '([^"]+)"').Groups[1].Value
You also need to include the starting double quotes otherwise it would start capturing from the start until the first " is reached.
$prog = [regex]::match($s, '"([^"]+)"').Groups[1].Value
^
Try this regex "([^"]+)"
Regex101 Demo
Regular expressions can get hard to read, so best practice is to make them as simple as they can be while still solving all possible cases you might see. You are trying to retrieve the only numerical sequence in the entire string, so we should look for that and bypass using groups.
$s = '[assembly: AssemblyVersion("1.0.0.0")]'
$prog = [regex]::match($s, '[\d\.]+').Value
$prog
1.0.0.0
For the generic solution of data between double quotes, the other answers are great. If I were parsing AssemblyInfo.cs for the version string however, I would be more explicit.
$versionString = [regex]::match($s, 'AssemblyVersion.*([0-9].[0-9].[0-9].[0-9])').Groups[1].Value
$version = [version]$versionString
$versionString
1.0.0.0
$version
Major Minor Build Revision
----- ----- ----- --------
1 0 0 0
Update/Edit:
Related to parsing the version (again, if this is not a generic question about parsing text between double quotes) is that I would not actually have a version in the format of M.m.b.r in my file because I have always found that Major.minor are enough, and by using a format like 1.2.* gives you some extra information without any effort.
See Compile date and time and Can I generate the compile date in my C# code to determine the expiry for a demo version?.
When using a * for the third and fourth part of the assembly version, then these two parts are set automatically at compile time to the following values:
third part is the number of days since 2000-01-01
fourth part is the number of seconds since midnight divided by two (although some MSDN pages say it is a random number)
Something to think about I guess in the larger picture of versions, requiring 1.2.*, allowing 1.2, or 1.2.3, or only accepting 1.2.3.4, etc.
I have a large valid JavaScript file (utf-8), from which I need to extract all text strings automatically.
For simplicity, the file doesn't contain any comment blocks in it, only valid ES6 JavaScript code.
Once I find an occurrence of ' or " or `, I'm supposed to scan for the end of the text block, is where I got stuck, given all the possible variations, like "'", '"', "\'", '\"', '", `\``, etc.
Is there a known and/or reusable algorithm for detecting the end of a valid ES6 JavaScript text block?
UPDATE-1: My JavaScript file isn't just large, I also have to process it as a stream, in chunks, so Regex is absolutely not usable. I didn't want to complicate my question, mentioning joint chunks of code, I will figure that out myself, If I have an algorithm that can work for a single piece of code that's in memory.
UPDATE-2: I got this working initially, thanks to the many advises given here, but then I got stuck again, because of the Regular Expressions.
Examples of Regular Expressions that break any of the text detection techniques suggested so far:
/'/
/"/
/\`/
Having studied the matter closer, by reading this: How does JavaScript detect regular expressions?, I'm afraid that detecting regular expressions in JavaScript is a whole new ball game, worth a separate question, or else it gets too complicated. But I appreciate very much if somebody can point me in the right direction with this issue...
UPDATE-3: After much research I found with regret that I cannot come up with an algorithm that would work in my case, because presence of Regular Expressions makes the task incredibly more complicated than was initially thought. According to the following: When parsing Javascript, what determines the meaning of a slash?, determining the beginning and end of regular expressions in JavaScript is one of the most complex and convoluted tasks. And without it we cannot figure out when symbols ', '"' and ` are opening a text block or whether they are inside a regular expression.
The only way to parse JavaScript is with a JavaScript parser. Even if you were able to use regular expressions, at the end of the day they are not powerful enough to do what you are trying to do here.
You could either use one of several existing parsers, that are very easy to use, or you could write your own, simplified to focus on the string extraction problem. I hardly imagine you want to write your own parser, even a simplified one. You will spend much more time writing it and maintaining it than you might think.
For instance, an existing parser will handle something like the following without breaking a sweat.
`foo${"bar"+`baz`}`
The obvious candidates for parsers to use are esprima and babel.
By the way, what are you planning to do with these strings once you extract them?
If you only need an approximate answer, or if you want to get the string literals exactly as they appear in the source code, then a regular expression can do the job.
Given the string literal "\n", do you expect a single-character string containing a newline or the two characters backslash and n?
In the former case you need to interpret escape sequences exactly like a JavaScript interpreter does. What you need is a lexer for JavaScript, and many people have already programmed this piece of code.
In the latter case the regular expression has to recognize escape sequences like \x40 and \u2026, so even in that case you should copy the code from an existing JavaScript lexer.
See https://github.com/douglascrockford/JSLint/blob/master/jslint.js, function tokenize.
Try code below:
txt = "var z,b \n;z=10;\n b='321`1123`321321';\n c='321`321`312`3123`';"
function fetchStrings(txt, breaker){
var result = [];
for (var i=0; i < txt.length; i++){
// Define possible string starts characters
if ((txt[i] == "'")||(txt[i] == "`")){
// Get our text string;
textString = txt.slice(i+1, i + 1 + txt.slice(i+1).indexOf(txt[i]));
result.push(textString)
// Jump to end of fetched string;
i = i + textString.length + 1;
}
}
return result;
};
console.log(fetchStrings(txt));
I'm trying to parse an incoming string to determine whether it contains any non-emojis.
I've gone through this great article by Mathias and am leveraging both native punycode for the encoding / decoding and regenerate for the regex generation. I'm also using EmojiData to get my dictionary of emojis.
With that all said, certain emojis continue to be pesky little buggers and refuse to match. For certain emoji, I continue to get a pair of code points.
// Example of a single code point:
console.log(punycode.ucs2.decode('π©'));
>> [ 128169 ]
// Example of a paired code point:
console.log(punycode.ucs2.decode('βοΈ'));
>> [ 8987, 65039 ]
Mathias touches on this in his article (and gives an example of punycode working around this) but even using his example I get an incorrect response:
function countSymbols(string) {
return punycode.ucs2.decode(string).length;
}
console.log(countSymbols('π©'));
>> 1
console.log(countSymbols('βοΈ'));
>> 2
What is the best way to detect whether a string contains all emojis or not? This is for a proof of concept so the solution can be as brute force as need be.
--- UPDATE ---
A little more context on my pesky emoji above.
These are visually identical but in fact different unicode values (the second one is from the example above):
β // \u231b
βοΈ // \u231b\ufe0f
The first one works great, the second does not. Unfortunately, the second version is what iOS seems to use (if you copy and paste from iMessage you get the second one, and when receiving a text from Twilio, same thing).
The U+FE0F is not a combining mark, it's a variation sequence that controls the rendering of the glyph (see this answer). Removing such sequences may change the appearance of the character, for example: U+231B+U+FE0E (βοΈ).
Also, emoji sequences can be made from multiple code points. For example, U+0032 (2) is not an emoji by itself, but U+0032+U+20E3 (2β£) or U+0032+U+20E3+U+FE0F (2β£οΈ) isβbut U+0041+U+20E3 (Aβ£) isn't. A complete list of emoji sequences are maintained in the emoji-data.txt file by the Unicode Consortium (the emoji-data-js library appears to have this information).
To check if a string contains emoji characters, you will need to test if any single character is in emoji-data.txt, or starts a substring for a sequence in it.
If, hypothetically, you know what non-emoji characters you expect to run into, you can use a little lodash magic via their toArray or split modules, which are emoji aware. For example, if you want to see if a string contains alphanumeric characters, you could write a function like so:
function containsAlphaNumeric(string){
return _(string).toArray().filter(function(char){
return char.match(/[a-zA-Z0-9]/);
}).value().length > 0 ? true : false;
}
I have a div that contains a settings icon that is a html miscellaneous symbol
<span class="settings-icon">β</span>
I have a jasmine test that checks the div contents to makes sure that it is not changed.
it("the settings div should contain β", function() {
var settingsIconDiv = $('.settings-icon');
expect(settingsIconDiv.text())
.toContain('β');
});
It will not pass as it is evaluated as its glyph symbol of a gear icon β
How to I decode the glyph in order to pass the test?
To get actual character from Unicode to compare it to a literal in HTML you can use String.fromCharCode() e.g.
.toContain(String.fromCharCode(9881))
You should check against the string 'β' or, if you do not how to enter it in your code, the escape notation \u2699. There are other, clumsier ways to construct a string containing the character, but simplicity is best.
No matter how the character is written in HTML source code (e.g., as the reference β), it appears in the DOM as the character itself, U+2699. In JavaScript, a string like β is just a sequence of seven Ascii characters (though you can pass it to a function that parses it as an HTML character reference, or you can assign it e.g. to the innerHTML property, causing HTML parsing, but this is rather pointless and confusing).
To match the browser behavior (because you don't know how it is encoded in html or in text) i would try the following
.toContain($("<span>β</span>").text()) instead of .toContain('β').
That way it should match how it is stored in the dom.
The String.fromCharCode(9881); mentioned by Yuriy Galanter will definitely also work reliable. But because dom engine and the js engine are two different parts, that could behave differently, i would test with both techniques.