Why substring does not handle negative indices? [closed]

Why substring does not handle negative indices? [closed] - javascript

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
substr() handles negative indices perfectly but substring() only accepts nonnegative indices.
Is there a reason of not using substr in favor of substring? The usage of negative indices are so useful in a lot of cases by viewing the space of indices as cyclic group. Why substr is indicated "deprecated" by MDN?

substring is when you want to specify a starting and ending index. substr is when you want to specify a starting offset and a length. They do different things and have different use cases.
Edit:
To better answer the exact question of
Why substring does not handle negative indices?
substring specifies a starting and ending index of characters in a string. substr deals with a starting offset and a length. It makes sense to me that substring does not allow a negative index, because there really isn't a such thing as a negative index (the characters in a string are indexed from 0 to n, a "negative index" would be out of bounds). Since substr is dealing with an offset vs an index, I feel the term offset is loose enough to allow for a negative offset, which of course means counting backwards from the end of the string rather than forward from the beginning. This might just be semantics, but its how I make sense of it.
Why is substr deprecated?
I would argue that is in fact not deprecated.
The revision history for the substr MDN states the deprecation notice was put in based on this blog post:
Aug 16, 2016, 12:00:34 AM
hexalys
add deprecated mention per https://blog.whatwg.org/javascript
Which states that the HTML string methods are deprecated (which they should be!). These are methods that wrap a string in an HTML tag, ie, "abc".sub() would return <sub>abc</sub>. The blog post lists out all of the HTML string methods, and imho, erroneously includes subtr as an HTML string method (it isn't).
So this looks like a misunderstanding to me.
(Excerpt below, emphasis added by me)
Highlights:
The infamous “string HTML methods”: String.prototype.anchor(name), String.prototype.big(), String.prototype.blink(),
String.prototype.bold(), String.prototype.fixed(),
String.prototype.fontcolor(color), String.prototype.fontsize(size),
String.prototype.italics(), String.prototype.link(href),
String.prototype.small(), String.prototype.strike(),
String.prototype.sub(), String.prototype.substr(start, length), and
String.prototype.sup(). Browsers implemented these slightly
differently in various ways, which in one case lead to a security
issue (and not just in theory!). It was an uphill battle, but
eventually browsers and the ECMAScript spec matched the behavior that
the JavaScript Standard had defined.
https://blog.whatwg.org/javascript

substr is particularly useful when you are only interested in the last N characters of a string of unknown length.
For example, if you want to know if a string ends with a single character:
function endsWith(str, character) {
return str.substr(-1) === character;
}
endsWith('my sentence.', '.'); // => true
endsWith('my other sentence', '.'); // => false
Implementing this same function using substring would require you calculating the length of the string first.
function endsWith(str, character) {
var length = str.length;
return str.substring(length - 1, length) === character;
}
Both functions can be used to get the same results, but having substr is more convenient.

There are three functions in JS that do more or less the same:
substring
substr
slice
I guess most people use the latter, because it matches its array counterpart. The former two are more or less historical relicts (substring was in JS1, then substr came in two different flavours etc).
Why substr is indicated "deprecated" by MDN?
The notice has been added as per this post by Mathias where substr is listed under "string HTML methods" (?). The reason of the deprecation is that it belongs to the Annex B which says:
This annex describes various legacy features and other characteristics of web browser based ECMAScript implementations. All of the language features and behaviours specified in this annex have one or more undesirable characteristics and in the absence of legacy usage would be removed from this specification.

Related

Converting between Bases, and from a String to any Base N in JavaScript (Radix conversion)

First post on here!
I've done a couple hours of research, and I can't seem to find any actual answers to this, though it may be my understanding that's wrong.
I want to convert a string, lets say "Hello 123" into any Base N, lets say N = 32 for simplicity.
My Attempt
Using Javascript's built-in methods (Found through other websites, and):
stringToBase(string, base) {
return parseInt(string, 10).toString(base);
}
So, this encodes the string to base 10 (decimal) and then into the base I want, however the caveat with this is that it only works from 2 to 36, which is good, but not really in the range that I'm looking for.
More
I'm aware that I can use the JS BigInt, but I'm looking to convert with bases as high as 65536 that uses an arbitrary character set that does not stop when encountering ASCII or (yes I'm aware it's completely useless, I'm just having some fun and I'm very persistent). Most solutions I've seen use an alphabet string or array (e.g. "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz+-").
I've seen a couple threads that say that encoding to a radix which is not divisible by 2 won't work, is that true? Since base 85, 91, exist.
I know that the methods atob() and btoa() exist, but this is only for Radix/Base 64.
Some links:
I had a look at this github page: https://github.com/gliese1337/base-to-base/blob/main/src/index.ts , but it's in typescript and I'm not even sure what's going on.
This one is in JS: https://github.com/adanilo/base128codec/blob/master/b128image.js . It makes a bit more sense than the last one, but the fact there is a whole github page just for Base 128 sort of implies that they're all unique and may not be easily converted.
This is the aim of the last and final base: https://github.com/qntm/base65536 . The output of "Hello World!" for instance, is "驈ꍬ啯𒁗ꍲ噤".
(I can code java much better than JS, so if there is a java solution, please let me know as well)

Why does JavaScript `Intl.Collator.prototype.compare()` method yield different result than traditional UTF-16 comparison for special characters?

Today, I stumbled onto a weird issue with the JavaScript / ECMAScript Internationalization API that I can't find a suitable explanation anywhere. I am getting different results when comparing two specific characters - the forward-slash (/) and the underscore (_) characters using:
plain-vanilla / traditional UTF-16 based comparison
the Intl.Collator.prototype.compare() method
The Plain / Traditional UTF-16 based comparison
// Vanilla JavaScript comparator
const cmp = (a, b) => a < b ? -1 : a > b ? 1 : 0;
console.log(cmp('/', '_'));
// Output: -1
// When sorting
const result = ['/', '_'].sort(cmp);
console.log(result);
// Output: ['/', '_']
The Intl.Collator.prototype.compare() method
const collator = new Intl.Collator('en', {
sensitivity: 'base',
numeric: true
});
console.log(collator.compare('/', '_'));
// Output: 1
// When sorting
const result = ['/', '_'].sort(collator.compare);
console.log(result);
// Output: ['_', '/']
Questions
Why do both techniques yield different results? Is this a bug in the ECMAScript implementation? What am I missing / failing to understand here? Are there other such character combinations which would yield different results for the English (en) language / locale?
Edit 2021-10-01
As #t-j-crowder pointed out, replaced all "ASCII" to "UTF-16".

In general
When you use < and > on strings, they're compared according to their UTF-16 code unit values (not ASCII, but ASCII overlaps with those values for many common characters). This is, to put it mildly, problematic. For instance, ask the French if "z" < "é" should really be true (indicating that z comes before é):
console.log("z" < "é"); // true?!?!
When you use Intl.Collator.prototype.compare, it uses an appropriate collation (loosely, ordering) for your locale according to the options you provide. That is likely to be different from the results for UTF-16 code unit values in many cases. For instance, even in an en locale, Collator returns the more reasonable result that z comes after é:
console.log(new Intl.Collator("en").compare("z", "é")); // 1
_ and / specifically
I can't tell you specifically why _ and / have a different order from their UTF-16 code units in the en locale you're using (and also the one that I'm using), whether it's en-US, en-UK, or something else. But it's not surprising to find that their order differs between ASCII and Unicode. (Remember, the UTF-16 code unit values for _ and / come from their ASCII values.)
ASCII's order was carefully designed in the early 1960s (there's a PDF that goes into wonderful detail about it), but largely without respect to linguistic ordering other than the ordering of A-Z and 0-9. / was in the original ASCII from 1963. _ wasn't added until 1967, in one of the available positions which was higher numerically than /. There's probably no more significant reason than that why _ is later/higher (numerically) than / in ASCII.
Unicode's collation order was carefully designed in the 1990s (and on through to today) with different goals (including linguistic ones), design requirements, and design constraints. As far as I can tell (I'm not a Unicode expert), Unicode's collation is described by TR10 and Part 5 of TR35. I haven't found a specific rationale for why _ is before / in the root collation (en uses the root collation). I'm sure it's in there somewhere. I did notice that one aspect of it seems to be grouping by category. The category of _ is "Connector punctuation" while the category of / is "Other punctuation." Perhaps that has something to do with why / is later than _.
But the fundamental answer is: They differ because ASCII's ordering and Unicode collation were designed with different constraints and requirements.

How do I reverse an array in JavaScript in 16 characters or less without .reverse()?

I'm trying to solve a challenge on Codewars which requires you to reverse an array in JavaScript, in 16 characters or less. Using .reverse() is not an option.
The maximum number of characters allowed in your code is 28, which includes the function name weirdReverse, so that leaves you with just 16 characters to solve it in. The constraint -
Your code needs to be as short as possible, in fact not longer than 28 characters
Sample input and output -
Input: an array containing data of any types. Ex: [1,2,3,'a','b','c',[]]
Output: [[],'c','b','a',3,2,1]
The starter code given is -
weirdReverse=a=>
My solution (29 characters) is -
weirdReverse=a=>a.sort(()=>1)
which of course fails -
Code length should less or equal to 28 characters.
your code length = 29 - Expected: 'code length <= 28', instead got: 'code length > 28'
I'm not sure what else to truncate here.
Note - I did think about posting this question on CodeGolf SE, but I felt it wouldn't be a good fit there, due to the limited scope.

I'd like to give you a hint, without giving you the answer:
You're close, but you can save characters by not using something you need to add in your code.
By adding the thing you won't use, you can remove ().
Spoiler (answer):
// Note: this only really works for this specific case.
// Never EVER use this in a real-life scenario.
var a = [1,2,3,'a','b','c',[]]
weirdReverse=a=>a.sort(x=>1)
// ^ That's 1 character shorter than ()
console.log(weirdReverse(a))

Hard in Javascript interview

Everybody.
Several days ago an Interviewer asked me a question.
And I couldn't answer it. May be at this site exists some guru JS. =)
We have just one string: VARNAME[byte][byte][byte][byte] where [byte] is place for one char.
Question: How write JS correct, if pair of [byte][byte] in HEX MUST BE NOT MORE than 1000 in decimal?
I tried following :
1) VARNAME[20][3D][09][30] it is equal
2) VARNAME<space>=1<space> and it is correct JS CODE BUT!
3) 0x203D = 8253 in decimal not correct must be <=1000
0x0120 = 2352 not correct must be <=1000!
I tried replacing 20 on 09, then:
0x093d = 2365 it is more good, but more than 1000 =(
How i can make it? Interviewer says that it is possible because char can be any( i mean
varname;<space><space><space> and etc), but he can not say me an answer.
Who can make it guys?

The question as described has no answer.
The lowest code point that can appear in an expression context after a variable references is \u0009 which, as you point out, will result in a value greater than 1000 (>= 2304). The ECMAScript 5 specification requires JavaScript environment to generate an early error when an invalid character is encountered. The only characters legal here are a identifier continuation character or a InputElementDiv which is either Whitespace, LineTerminator, Comment, Token, and DivPunctuator, none of which allow code points in the range \u0000-\u0003 which would be required for the question to have an answer.
There are some environments that terminate parsing when a \u0000 is encountered (the C end-of-string character) but those do not conform ES5 in this respect.
The statement that JavaScript allows any character in this position is simply wrong.
This all changes if VARNAME is in a string or a regular expression, however, which can both take character in the range \u0000-\u0003. If this is the trick the interviewer is looking for I can only say that was an unfair question.
Remember, in an interview, you are interviewing the company as much, or more, than the company is interviewing you. I would have serious reservations about joining a company that considers such a question a valid question to use in an interview.

JavaScript strings - UTF-16 vs UCS-2?

I've read in some places that JavaScript strings are UTF-16, and in other places they're UCS-2. I did some searching around to try to figure out the difference and found this:
Q: What is the difference between UCS-2 and UTF-16?
A: UCS-2 is obsolete terminology which refers to a Unicode
implementation up to Unicode 1.1, before surrogate code points and
UTF-16 were added to Version 2.0 of the standard. This term should now
be avoided.
UCS-2 does not define a distinct data format, because UTF-16 and UCS-2
are identical for purposes of data exchange. Both are 16-bit, and have
exactly the same code unit representation.
Sometimes in the past an implementation has been labeled "UCS-2" to
indicate that it does not support supplementary characters and doesn't
interpret pairs of surrogate code points as characters. Such an
implementation would not handle processing of character properties,
code point boundaries, collation, etc. for supplementary characters.
via: http://www.unicode.org/faq/utf_bom.html#utf16-11
So my question is, is it because the JavaScript string object's methods and indexes act on 16-bit data values instead of characters what make some people consider it UCS-2? And if so, would a JavaScript string object oriented around characters instead of 16-bit data chunks be considered UTF-16? Or is there something else I'm missing?
Edit: As requested, here are some sources saying JavaScript strings are UCS-2:
http://blog.mozilla.com/nnethercote/2011/07/01/faster-javascript-parsing/
http://terenceyim.wordpress.com/tag/ucs2/
EDIT: For anyone who may come across this, be sure to check out this link:
http://mathiasbynens.be/notes/javascript-encoding

JavaScript, strictly speaking, ECMAScript, pre-dates Unicode 2.0, so in some cases you may find references to UCS-2 simply because that was correct at the time the reference was written. Can you point us to specific citations of JavaScript being "UCS-2"?
Specifications for ECMAScript versions 3 and 5 at least both explicitly declare a String to be a collection of unsigned 16-bit integers and that if those integer values are meant to represent textual data, then they are UTF-16 code units. See
section 8.4 of the ECMAScript Language Specification in version 5.1
or section 6.1.4 in version 13.0.
EDIT: I'm no longer sure my answer is entirely correct. See the excellent article mentioned above, which in essence says that while a JavaScript engine may use UTF-16 internally, and most do, the language itself effectively exposes those characters as if they were UCS-2.

It's UTF-16/USC-2. It can handle surrogate pairs, but the charAt/charCodeAt returns a 16-bit char and not the Unicode codepoint. If you want to have it handle surrogate pairs, I suggest a quick read through this.

Its just a 16-bit value with no encoding specified in the ECMAScript standard.
See section 7.8.4 String Literals in this document: http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf

Things have changed since 2012. JavaScript strings are now UTF-16 for real. Yes, the old string methods still work on 16-bit code units, but the language is now aware of UTF-16 surrogates and knows what to do about them if you use the string iterator. There's also Unicode regex support.
// Before
"😀😂💩".length // 6
// Now
[..."😀😂💩"].length // 3
[..."😀😂💩"] // [ '😀', '😂', '💩' ]
[... "😀😂💩".matchAll(/./ug) ] // 3 matches as above
// Regexes support unicode character classes
"café".normalize("NFD").match(/\p{L}\p{M}/ug) // [ 'é' ]
// Extract code points
[..."😀😂💩"].map(char => char.codePointAt(0).toString(16)) // [ '1f600', '1f602', '1f4a9' ]

We Keep Coding

JavaScript is the programming language of the Web.