Is JavaScript string comparison just as fast as number comparison?

Is JavaScript string comparison just as fast as number comparison? - javascript

I'd like to write a little library for JavaScript enums. For me to do that, I need to decide how to store the enum values. Therefore, I'd like to use the fastest way when comparing, but I also want something that is debuggable, so I'm torn between using strings or numbers. I know I could use objects too, but that would be another question
For example
// I don't want this because when debugging, you'd see just the value 0
var Planets = {Earth:0, Mars:1, Venus: 2}
// I'd prefer this so that Planets.Earth gives me a nice readable value ("Earth")
var Planets = {Earth: 'Earth', Mars: 'Mars'}
But I'm afraid that when I compare them using if (myPlanet === Planet.Earth), the string comparison could take a lot longer (say if it were in a tight loop). This should be the case because http://ecma-international.org/ecma-262/5.1/#sec-11.9.6 says
If Type(x) is String, then return true if x and y are exactly the same sequence of characters (same length and same characters in corresponding positions); otherwise, return false.
But when I wrote a test case, I found that they take the same amount of time http://jsperf.com/string-comparison-versus-number-comparison/2 so it doesn't seem like it's scanning the whole string.
I know this could be a micro optimization, but my question is: is string equality comparison done using pointers and therefore just as fast as number equality comparison?

String comparison could be "just as fast" (depending on implementation and values) - or it could be "much slower".
The ECMAScript specification describes the semantics, not the implementation. The only way to Know for Certain is to create an applicable performance benchmark on run it on a particular implementation.
Trivially, and I expect this is the case1, the effects of string interning for a particular implementation are being observed.
That is, all string values (not String Objects) from literals can be trivially interned into a pool such that implIdentityEq("foo", "foo") is true - that is, there need only one string object. Such interning can be done after constant folding, such that "f" + "oo" -> "foo" - again, per a particular implementation as long as it upholds the ECMAScript semantics.
If such interning is done, then for implStringEq the first check could be to evaluate implIdentityEq(x,y) and, if true, the comparison is trivially-true and performed in O(1). If false, then a normal string character-wise comparison would need to be done which is O(min(n,m)).
(Immediate falseness can also be determined with x.length != y.length, but that seems less relevant here.)
1 While in the above I argue for string interning being a likely cause, modern JavaScript implementations perform a lot of optimizations - as such, interning is only a small part of the various optimizations and code hoistings that can (and are) done!
I've created an "intern breaker" jsperf. The numbers agree with the hypothesis presented above.
If a string is interned then comparison is approximate in performance to testing for "identity" - while it is slower than a numeric comparison, this is still much faster than a character-by-character string comparison.
Holding the above assertion, IE10 does not appear to consider object-identity for pass-fast string comparisons although it does use a fast-fail length check.
In Chrome and Firefox, two intern'ed strings which are not equal are also compared as quickly as two that are - there is likely a special case for comparing between two different interned strings.
Even for small strings (length = 8), interning can be much faster. IE10 again shows it doesn't have this "optimization" even though it appears to have an efficient string comparison implementation.
The string comparison can fail as soon as the first different character is encountered: even comparing long strings of equal length might only compare the first few characters.
Do common JavaScript implementations use string interning? (but no references given)
Yes. In general any literal string, identifier, or other constant string in JS source is interned. However implementation details (exactly what is interned for instance) varies, as well as when the interning occurs
See JS_InternString (FF does have string interning, although where/how the strings are implicitly interened from JavaScript, I know not)

There are cases when string comparison can be much slower (comparing dynamically generated strings)
The following is 77% slower (in chrome and IE) than all the other tests
var StringEarth = 'Ear' + 'th';
for (var i = 0; i < ITERATIONS; i++) {
x = StringPlanets.Venus === StringEarth;
}
The flaw in the tests mentioned in the question is the fact that we are testing against literal strings. It seems that JavaScript is optimized so that string comparison for string literals is done just by testing a pointer. This can be observed by creating the strings dynamically. My best guess is that strings from the literal string pool are marked so that they can be compared using addresses only.
Note that string comparison seems just as fast in FF even for dynamic strings. Also, that it's just as slow for even literal strings.
Conclusion All browsers behave differently so string comparison may or may not be slower.

In general, at best String interning (making a string with a given value into a unique reference or a O(1) comparable symbol) is going to take O(n) time, as it can't do that effectively without looking at all the characters involved.
The question of relative efficiency then amounts to over how many comparisons the interning is going to be amortized.
In the limit, a very clever optimizer can pull out static expressions which build up strings and intern them once.
Some of the tests above, use strings which will have been interned in which case the comparison is potentially O(1). In the case where enums are based on mapping to integers, it will be O(1) in any implementation.
The expensive comparison cases arise when at least one of the operands is a truly dynamic string. In this case it is impossible to compare equality against it in less than O(n).
As applied to the original question, if the desire is to create something akin to an enum in a different language, the only lookout is to ensure that the interning can is done in only a few places. As pointed out above, different Browser use different implementations, so that can be tricky, and as pointed out in IE10 maybe impossible.
Caveat lacking string interning (in which case you need the integer version of the enum implementation give), #JuanMendes' string-based enum implementations will be essentially O(1) if he arranges for the value of the myPlanet variable to be set in O(1) time. If that is set using Planets.value where value is an established planet it will be O(1).

Related

In JavaScript, should stringified numbers be converted to integers before subtracting them? (best practices)

I recently wrote some code that is grabbing text from two separate elements and then subtracting them, to my surprise I didnt have to convert them first to integers. I did a little looking around and it seems JavaScript converts strings to numbers when using the subtraction operator. Does anyone know if this is ok to leave them as strings or for best practice should they first me converted to integers? and if so why? Thank you.
example:
"10" - "6" = 4

[…] to my surprise I didnt have to convert them first to integers […]
A couple more surprises, then:
JavaScript resolves many operations (such as arithmetic) by implicitly coercing incompatible values to a different type, instead of raising an exception. This makes JavaScript “weakly typed” to that extent.
There is no integer type built into JavaScript; the only number type is IEEE-754 floating-point.
So, your string values were coerced to floating-point values, in the context of the arithmetic operation. JavaScript didn't tell you this was happening.
This is a source of bugs that can remain hidden for a long time, because if your string values would successfully convert to a number, the operation would succeed, even if you would expect those values to raise an error.
js> "1e15" - "0x45" // The reader might have expected this to raise an error.
999999999999931
The brief “Wat” presentation by Gary Bernhardt is packed with other surprising (and hilarous) results of JavaScript's implicit type coercion.
Does anyone know if this is ok to leave them as strings or for best practice should they first me converted to [numbers]?
Yes, in my opinion you should do arithmetic only on explicitly-converted numbers, because (as you discovered) for newcomers reading the code, the implicit coercion rules are not always obvious.

parsing international numerals

I'm trying to figure out how my script will behave if rendered in a browser using Chinese (or other) locale using Chinese numerals (or another non-Latin symbol set). Can't seem to find any info on this on the interwebs.
Looking at the page
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Number/toLocaleString
we see examples of localized numbers when converting from number to string, but what about the other way around? I tried parseInt("一二三") in IE11 debug console which returned NaN, but I'm not using Chinese Windows. Could someone test this?
My confusion comes from JavaScript having loosely typed data, so what if I end up running into an implicit string-to-number conversion, such as this:
var a = "١٢٣";
var b = .01;
console.log(a*b);
Mind you my variables a and b could come from user input in a more complex example. How can you make sure that input coming from a non-Latin symbology is converted to the right number-representation internally before you do arithmetic if parseInt and implicit conversion don't work?

It won't work for several reasons. Notice firstly that while there is a toLocalString but there is no parseLocalStringInt or fromLocaleString. Secondly javascript only really does implicit type coercion when particular operators are used e.g. ==. * however can't be used in this fashion and even == and other operators only support very limited coercion in comparison with what you are describing.
This coercion can still be very dangerous or useful depending on your point of view e.g.
0 == false is true
but 0 === false is false but it certainly isn't as powerful and you think it is.

Is Javascript substring virtual?

If we have a huge string, named str1, say 5 million characters long, and then str2 = str1.substr(5555, 100) so that str2 is 100 characters long and is a substring of str1 starting at 5555 (or any other randomly selected position).
How JavaScript stores str2 internally? Is the string contents copied or the new string is sort of virtual and only a reference to the original string and values for position and size are stored?
I know this is implementation dependent, ECMAScript standard (probably) does not define what's under the hood of the string implementation. But I want to know from some expert who knows V8 or SpiderMonkey from inside well enough to clarify this.
Thank you

AFAIK V8 has four string representations:
ASCII
UTF-16
concatenation of multiple strings
slice of another string
Adventures in the land of substrings and RegExps has great explanations and illustrations.
Thus, it does not have to copy the string; it just has to beginning and ending markers to the other string.
SpiderMonkey does the same thing. (See Large substrings ~9000x faster in Firefox than Chrome: why? ... though the answer for Chrome is outdated.)
This can give real speed boosts, but sometimes this is undesirable, since it can cause small strings to hold onto the memory of the larger parent string (V8 bug report)

This old blog post of mine explains it, as well as some other string representation forms: https://web.archive.org/web/20170607033600/http://blog.cdleary.com:80/2012/01/string-representation-in-spidermonkey/
Search for "dependent string". I think I know what you might be getting at with the question: they can be problematic things, at times, because if there are no references to the original, you can keep a giant string around in order to keep a bitty little substring that's actually semantically reachable. There are things that an implementation could do to mitigate that problem, like record information on a GC-generation basis to see if such one-dependent-string entities exist and collapse them to their minimal size, but last I knew of that was not being done. (Essentially with that kind of approach you're recovering runtime_refcount == 1 style information at GC-sweep time.)

Are the advantages of Typed Arrays in JavaScript is that they work the same or similar in C?

I've been playing around with Typed Arrays in JavaScript.
var buffer = new ArrayBuffer(16);
var int32View = new Int32Array(buffer);
I imagine normal arrays ([1, 257, true]) in JavaScript have poor performance because their values could be of any type, therefore, reaching an offset in memory is not trivial.
I originally thought that JavaScript array subscripts worked the same as objects (as they have many similarities), and were hash map based, requiring a hash based lookup. But I haven't found much credible information to confirm this.
So, I'd assume the reason why Typed Arrays perform so well is because they work like normal arrays in C, where they're always typed. Given the initial code example above, and wishing to get the 10th value in the typed array...
var value = int32View[10];
The type is Int32, so each value must consist of 32 bits or 4 bytes.
The subscript is 10.
So the location in memory of that value is <array offset> + (4 * 10), and then read 4 bytes to get the total value.
I basically just want to confirm my assumptions. Is my thoughts around this correct, and if not, please elaborate.
I checked out the V8 source to see if I could answer it myself, but my C is rusty and I'm not too familiar with C++.

Typed Arrays were designed by the WebGL standards committee, for performance reasons. Typically Javascript arrays are generic and can hold objects, other arrays and so on - and the elements are not necessarily sequential in memory, like they would be in C. WebGL requires buffers to be sequential in memory, because that's how the underlying C API expects them. If Typed Arrays are not used, passing an ordinary array to a WebGL function requires a lot of work: each element must be inspected, the type checked, and if it's the right thing (e.g. a float) then copy it out to a separate sequential C-like buffer, then pass that sequential buffer to the C API. Ouch - lots of work! For performance-sensitive WebGL applications this could cause a big drop in the framerate.
On the other hand, like you suggest in the question, Typed Arrays use a sequential C-like buffer already in their behind-the-scenes storage. When you write to a typed array, you are indeed assigning to a C-like array behind the scenes. For the purposes of WebGL, this means the buffer can be used directly by the corresponding C API.
Note your memory address calculation isn't quite enough: the browser must also bounds-check the array, to prevent out-of-range accesses. This has to happen with any kind of Javascript array, but in many cases clever Javascript engines can omit the check when it can prove the index value is already within bounds (such as looping from 0 to the length of the array). It also has to check the array index is really a number and not a string or something else! But it is in essence like you describe, using C-like addressing.
BUT... that's not all! In some cases clever Javascript engines can also deduce the type of ordinary Javascript arrays. In an engine like V8, if you make an ordinary Javascript array and only store floats in it, V8 may optimistically decide it's an array of floats and optimise the code it generates for that. The performance can then be equivalent to typed arrays. So typed arrays aren't actually necessary to reach maximum performance: just use arrays predictably (with every element the same type) and some engines can optimise for that as well.
So why do typed arrays still need to exist?
Optimisations like deducing the type of arrays is really complicated. If V8 deduces an ordinary array has only floats in it, then you store an object in an element, it has to de-optimise and regenerate code that makes the array generic again. It's quite an achievement that all this works transparently. Typed Arrays are much simpler: they're guaranteed to be one type, and you just can't store other things like objects in them.
Optimisations are never guaranteed to happen; you may store only floats in an ordinary array, but the engine may decide for various reasons not to optimise it.
The fact they're much simpler means other less-sophisticated javascript engines can easily implement them. They don't need all the advanced deoptimisation support.
Even with really advanced engines, proving optimisations can be used is extremely difficult and can sometimes be impossible. A typed array significantly simplifies the level of proof the engine needs to be able to optimise around it. A value returned from a typed array is certainly of a certain type, and engines can optimise for the result being that type. A value returned from an ordinary array could in theory have any type, and the engine may not be able to prove it will always have the same type result, and therefore generates less efficient code. Therefore code around a typed array is more easily optimised.
Typed arrays remove the opportunity to make a mistake. You just can't accidentally store an object and suddenly get far worse performance.
So, in short, ordinary arrays can in theory be equally fast as typed arrays. But typed arrays make it much easier to reach peak performance.

Yes, you are mostly correct. With a standard JavaScript array, the JavaScript engine has to assume that the data in the array is all objects. It can still store this as a C-like array/vector, where the access to the memory is still like you described. The problem is that the data is not the value, but something referencing that value (the object).
So, performing a[i] = b[i] + 2 requires the engine to:
access the object in b at index i;
check what type the object is;
extract the value out of the object;
add 2 to the value;
create a new object with the newly computed value from 4;
assign the new object from step 5 into a at index i.
With a typed array, the engine can:
access the value in b at index i (including placing it in a CPU register);
increment the value by 2;
assign the new object from step 2 into a at index i.
NOTE: These are not the exact steps a JavaScript engine will perform, as that depends on the code being compiled (including surrounding code) and the engine in question.
This allows the resulting computations to be much more efficient. Also, the typed arrays have a memory layout guarantee (arrays of n-byte values) and can thus be used to directly interface with data (audio, video, etc.).

When it comes to performance, things can change fast. As AshleysBrain says, it comes down to whether the VM can deduce that a normal array can be implemented as a typed array quickly and accurately. That depends on the particular optimizations of the particular JavaScript VM, and it can change in any new browser version.
This Chrome developer comment provides some guidance that worked as of June 2012:
Normal arrays can be as fast as typed arrays if you do a lot of sequential access. Random access outside the bounds of the array causes the array to grow.
Typed arrays are fast for access, but slow to be allocated. If you create temporary arrays frequently, avoid typed arrays. (Fixing this is possible, but it's low priority.)
Micro-benchmarks such as JSPerf are not reliable for real-world performance.
If I might elaborate on the last point, I've seen this phenomenon with Java for years. When you test the speed of a small piece of code by running it over and over again in isolation, the VM optimizes the heck out of it. It makes optimizations which only make sense for that specific test. Your benchmark can get a hundredfold speed improvement compared to running the same code inside another program, or compared to running it immediately after running several different tests that optimize the same code differently.

I'm not really contributor to any javascript engine, only had some readings on v8, so my answer might not be completely true:
Well values in arrays(only normal arrays with no holes/gaps, not sparse. Sparse arrays are treated as objects.) are all either pointers or a number with a fixed length(in v8 they are 32 bit, if a 31 bit integer then it's tagged with a 0 bit in the end, else it's a pointer).
So I don't think finding the memory location is any different than a typedArray, since the number of the bytes are the same all over the array. But the difference comes that if it's an a object, then you have to add one unboxing layer, which doesn't happen for normal typedArrays.
And ofcourse when accessing typedArrays, definitely doesn't have type checking's that a normal array have(though that might be remove in a higly optimized code, which is only generated for hot code).
For Writing, if it's the same type shouldn't be much slower. If it's a different type then the JS engine might generate polymorphic code for it, which is slower.
You can also try making some benchmarks on jsperf.com to confirm.

Javascript Performance: How come looping through an array and checking every value is faster than indexOf, search and match?

This came as a huge surprise for me, and I'd like to understand this result. I made a test in jsperf that is basically supposed to take a string (that is part of a URL that I'd like to check) and checks for the presence of 4 items (that are in fact, present in the string).
It checks in 5 ways:
plain indexOf;
Split the string, then indexOf;
regex search;
regex match;
Split the string, loop through the array of items, and then check if any of them matches the things it's supposed to match
To my huge surprise, number 5 is the fastest in Chrome 21. This is what I can't explain.
In Firefox 14, the plain indexOf is the fastest, that one I can believe.

I'm also surprised but Chrome uses v8, a highly optimized JavaScript engine which pulls all kinds of tricks. And the guys at Google probably have the largest set of JavaScript to run to test the performance of their implementation. So my guess is this happens:
The compiler notices that the array is a string array (type can be determine at compile time, no runtime checks necessary).
In the loop, since you use ===, builtin CPU op codes to compare strings (repe cmpsb) can be used. So no functions are being called (unlike in any other test case)
After the first loop, everything important (the array, the strings to compare against) is in CPU caches. Locality rulez them all.
All the other approaches need to invoke functions and locality might be an issue for the regexp versions because they build a parse tree.

I have added two more tests : http://jsperf.com/finding-components-of-a-url/2
The single regExp is fastest now (on Chrome). Also regExp literals are faster than string literals converted to RegExp.

We Keep Coding

JavaScript is the programming language of the Web.