How to get the nth (Unicode) character from a string in JavaScript

How to get the nth (Unicode) character from a string in JavaScript - javascript

Suppose we have a string with some (astral) Unicode characters:
const s = 'Hi 👋 Unicode!'
The [] operator and .charAt() method don't work for getting the 4th character, which should be "👋":
> s[3]
'�'
> s.charAt(3)
'�'
The .codePointAt() does get the correct value for the 4th character, but unfortunately it's a number and has to be converted back to a string using String.fromCodePoint():
> String.fromCodePoint(s.codePointAt(3))
'👋'
Similarly, converting the string into an array using splats yields valid Unicode characters, so that's another way of getting the 4th one:
> [...s][3]
'👋'
But i can't believe that going from string to number back to string, or having to split the string into an array are the only ways of doing this seemingly trivial thing. Isn't there a simple method for doing this?
> s.simpleMethod(3)
'👋'
Note: i know that the definition of "character" is somewhat fuzzy, but for the purpose of this question a character is simply the symbol that corresponds to a Unicode codepoint (no combining characters, no grapheme clusters, etc).
Update: the String.fromCodePoint(str.codePointAt(n)) method is not really viable, since the nth position there doesn't take previous astral symbols into account: String.fromCodePoint('👋🙈'.codePointAt(1)) // => '�'
(I feel kinda dumb asking this; like i'm probably missing something obvious. But previous answers to this questions don't work on strings with Unicode simbols on astral planes.)

The string iterator is the only thing that iterates through code points rather than UCS-2/UTF-16 code units. So:
const string = 'Hi 👋 Unicode!';
for (const symbol of string) {
console.log(symbol);
}
So to get a specific code point based on its index from a string:
const string = 'Hi 👋 Unicode!';
// Note: The spread operator uses the string iterator under the hood.
const symbols = [...string];
symbols[3]; // '👋'
Still, this would break with grapheme clusters, or emoji sequences such as 👨‍👩‍👧‍👦 (👨 + U+200D ZERO WIDTH JOINER + 👩 + U+200D ZERO WIDTH JOINER + 👧 + U+200D ZERO WIDTH JOINER + 👦). Text segmentation helps with that.
Do you actually need to get the 4th code point in the string, though? What’s your use case?

You can use the new u flag to regexp if it's available to you.
const chars = 'Hi 👋 Unicode!'.match(/./ug);
console.log(chars);

The accepted answer to this question is out of date.
There is now a member of the String object called .at()/1 which does exactly what you're hoping for. If you have shims, shams, a transcompiler like TypeScript or Babel, etc, just set whatever your local configuration is, and you should be good to go.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/at
Amusingly, the spec for this feature, as well as the most common implementation shim (the one that I use,) is written by the person who authored the now out-of date accepted answer here. So even when he's out of date, he's still up to date.
If shimming or transcompiling isn't appropriate for you, there's a library called jsesc that can handle it for you through simple escaping. I'll give you three guesses who wrote the library. First two don't count.
https://www.npmjs.com/package/jsesc

Related

JavaScript - Dynamic RegExp for Duplicate Characters

I'm very new to programming, and I'm trying to solve an exercise where you encode a string (in this case, a single word) based on whether or not the constituent characters occur twice or more. Characters occurring only once encode to, say "■", characters encoding twice or more encode to, say "X".
Example: input = "hippodrome" :: output = "■■XXX■■X■■"
I managed to solve it in a very convoluted way using nested loops and a key:value object storing character:occurrences, but i am trying to refactor the solution to be more efficient using a dynamically created RegExp, but i think i'm not understanding regex notation.
function encodeDupes(word) {
let encoded = "";
for (let char of word) {
let regex = new RegExp(char + "{2,}","ig"); // create a regex to see if "char" occurs 2 or more times
regex.test(word) ? encoded += "X" : encoded += "■"; // check this char against rest of word, push appropriately
}
return encoded;
}
it works with a more simple gate like char < "m" ? do X : do Y, and i thought i understood this answer here ({n,} = at least n occurrences), but i'm new enough that i'm still not sure if it's my regex or my logic.
thank you!

I'm very new to programming, ..., I am trying to refactor the solution to be more efficient using a dynamically created RegExp...
That's a bit of a catch 22 because regular expressions trade efficiency for convenience. In order for the regular expression "engine" to run, a grammar must be established, and a lexer, parser, and evaluator transform the string-based input expressions into program output. It's (sometimes) convenient to implement a particular program using regular expressions, but it's almost impossible to beat out a fundamental algorithm that isn't slowed down by the regular expression engine.
I managed to solve it in a very convoluted way using nested loops and a key:value object storing character:occurrences ...
Convoluted indeed, but sadly not uncommon to see even "expert" programmers do such things. An efficient algorithm emerges when we realise we don't need to count each letter. Instead, we only need to know whether a letter occurs more than once. Using two Set objects, once and more, we can determine the answer without needing to allocate counter memory per letter! And sets are lightning fast, thanks to O(1) constant-time lookup -
function encodeDupes(word)
{ const once = new Set
const more = new Set
for (const c of word)
if (more.has(c))
continue
else if (once.has(c))
(once.delete(c), more.add(c))
else
once.add(c)
return Array
.from(word, c => more.has(c) ? "X" : "■")
.join("")
}
console.log(encodeDupes("hippodrome"))
Output
■■XXX■■X■■

Usually RegExp are used to compare entire words or phrases.
Whenever {n,} is used, it's searching for two or more characters consecutively. Here's an example:
n{2,}
nn # match
anna # match
nan # does not match
The following RegExp isn't perfect, but it should suffice, replacing n with the character of your choice
(.*n{2,}.*)|(.*n.*n.*)+
(.*n{2,}.*) —— for consecutive ‘n’s
| —— or
(.*n.*n.*)+ —— ‘n’s with anything in between
Let me know how it goes.

JavaScript: How to get specific text in a string using RegEx

I tummbled into this RegEx and I googled it. A lot. But unfortunately didn't quite understand how RegEx works...
So to make this quick since only a tiny winny part of my work requires it so I will be needing you guys. again :))
So here it goes...
All I want is to retrieve a specific string with a format of 0000x0000. For example:
Input:NameName975x945NameName
Output:
975x945
Must also consider string like this:
NameNameName9751x9451NameNameName
(the integer and string are longer...)

Use regex in String.prototype.match() to get specific part of string.
str.match(/\d+x\d+/)[0]
var str = "NameName975x945NameName";
var match = str.match(/\d+x\d+/)[0];
console.log(match)

We need a bit more detail, but I'll go in order:
Assuming there can be any number of digits before and after the x, and these can be of different lengths:
[\d]+x[\d]+
Assuming the number of digits before the x needs to be equal to the number of digits after the x (as in your example) and this number is finite (and small enough so that your regex isn't obscenely long):
[\d]{1}x[\d]{1}|[\d]{2}x[\d]{2}|[\d]{3}x[\d]{3} (and so on)
Check out this related answer for more details on handling this as the length of the number gets longer.
Then you can use String.prototype.match() with your regex to grab the matches within your string.

Use Regex in JS to find a particular patten in a particular spot in a string

I'm not particularly strong with Regular Expressions. Basically, I have the following string:
Showing 1-20 of 748 results.
I want to extract the "748", convert it to a number, and use it for comparisons. As expected, "Showing", "of", and "results" are not expected to change, but the numbers could. I have a couple of solutions in mind. The first is using lookbehinds, but I do not believe JS supports them. The second is doing a more blunt approach, maybe finding all the numbers in the string using match() and taking the element at the third index in the returned array (which should be "748").
Any thoughts on the best way to do this?

I would use the regex:
Showing \d+-\d+ of (\d+) results\.
where \d+ in each case means to match 1 or more digits. The parentheses around the number you wanted to find is called a capture group.
So if the search string was in str, the resulting JavaScript might look like:
var resultsRe = /Showing \d+-\d+ of (\d+) results\./;
var numResults = resultsRe.exec(str);
console.log("There are " + numResults + " results.");

For a simple approach you could do the following:
(\d+)\sresults
All it does is capture the integer directly before the word results.

negative number in parentheses using javascript

I use match to split a mathematics expression into separated strings and save them in an array.
var STRING = ST.match(/\d*\.\d+|\d+|[()/*+-]/g);
but this method separate everything including negative numbers which are inside parentheses.
For example (-2+4) does not give me -2, instead it saves - in one index of STRING array and 2 in the next index.
Is there anyway use match and save negative numbers which are in the parentheses?
This is what I want:
(-2+4):
STRING[0] give me (
STRING[1] give me -2
STRING[2] give me +
STRING[3] give me 4
STRING[4] give me )
and if there is no negative number work as normal:
(2+4):
STRING[0] give me (
STRING[1] give me 2
STRING[2] give me +
STRING[3] give me 4
STRING[4] give me )

I don't think it's possible to parse complex cases like "(-2+4*-(3.5--8))" with just a regex especially given we don't have negative look behind in javascript.
A solution would be to postprocess your match array by merging signs when they're between a separator and an unsigned expression.
In my opinion a regex is useful here, but only for the primary tokenization. Most of the work will be ahead of you as you'll build the binary expression tree (or any other formal representation you choose).

Unfortunately, if what you're trying to do is parsing a mathematical expression, regexps can not be used.
RegExps can be used in languages that are describable by Regular Grammars and arithmetical expressions can not, they are described by a Context Free Grammar (CFG). If you want to parse, and perhaps interpret the result, you'll certainly need some stacked state machine.
You can look at something like this well known algorithm.
Hope this helps.

You can add an optional sign to the numbers, that would work with your example:
var STRING = ST.match(/-?\d*\.\d+|-?\d+|[()/*+-]/g);
However, that will also turn a minus operator into a sign. The expression (4-2) would give you { "(", "4", "-2", ")" }.
Also, it will happily "parse" an expression like +---((((*** without complaining. If you want a result that makes sense, you should parse it for real, not just split it with a regular expression.

I think you have some mistake in your RegExp try this, it works for me:
var STRING = ST.match(/(\d*)(\.)(\d+)|(\d+)|[()\/*+-]/g);

Regex for integer, integer + dot, and decimals

I have searched StackOverflow and I can't find an answer as to how to check for regex of numeric inputs for a calculator app that will check for the following format with every keyup (jquery key up):
Any integer like: 34534
When a dot follows the integer when the user is about to enter a decimal number like this: 34534. Note that a dot can only be entered once.
Any float: 34534.093485
I don't plan to use commas to separate the thousands...but I would welcome if anyone can also provide a regex for that.
Is it possible to check the above conditions with just one regex? Thanks in advance.

Is a lone . a successful match or not? If it is then use:
\d+(\.\d*)?|\.\d*
If not then use:
\d+(\.\d*)?|\.\d+
Rather than incorporating commas into the regexes, I recommend stripping them out first: str = str.replace(/,/g, ''). Then check against the regex.
That wouldn't verify that digits are properly grouped into groups of three, but I don't see much value in such a check. If a user types 1,024 and then decides to add a digit (1,0246), you probably shouldn't force them to move the comma.

Let's write our your specifications, and develop from that.
Any integer: \d+
A comma, optionally followed by an integer: \.\d*
Combine the two and make the latter optional, and you get:
\d+\.?\d*
As for handling commas, I'd rather not go into it, as it gets very ugly very fast. You should simply strip all commas from input if you still care about them.

you can use in this way:
[/\d+./]
I think this can be used for any of your queries.
Whether it's 12445 or 1244. or 12445.43

I'm going to throw in a potentially downvoted answer here - this is a better solution:
function valid_float (num) {
var num = (num + '').replace(/,/g, ''), // don't care about commas, this turns `num` into a String
float_num = parseFloat(num);
return float_num == num || float_num + '.' == num; // allow for the decimal point, deliberately using == to ignore type as `num` is a String now
}
Any regex that does your job correctly will come with a big asterisk after it saying "probably", and if it's not spot on, it'll be an absolute pig to debug.
Sure, this answer isn't giving you the most awesomely cool one-liner that's going to make you go "Cool!", but in 6 months time when you realise it's going wrong somewhere, or you want to change it to do something slightly different, it's going to be a hell of a lot easier to see where, and to fix.

I'm using ^(\d)+(.(\d)+)+$ to capture each integer and to have an unlimited length, so long as the string begins and ends with integers and has dots between each integer group. I'm capturing the integer groups so that I can compare them.

We Keep Coding

JavaScript is the programming language of the Web.

How to get the nth (Unicode) character from a string in JavaScript - javascript

You can use the new u flag to regexp if it's available to you. const chars = 'Hi 👋 Unicode!'.match(/./ug); console.log(chars);

Related

JavaScript - Dynamic RegExp for Duplicate Characters

JavaScript: How to get specific text in a string using RegEx

Use Regex in JS to find a particular patten in a particular spot in a string

negative number in parentheses using javascript

Regex for integer, integer + dot, and decimals

Categories

Resources