Is Javascript substring virtual? - javascript

If we have a huge string, named str1, say 5 million characters long, and then str2 = str1.substr(5555, 100) so that str2 is 100 characters long and is a substring of str1 starting at 5555 (or any other randomly selected position).
How JavaScript stores str2 internally? Is the string contents copied or the new string is sort of virtual and only a reference to the original string and values for position and size are stored?
I know this is implementation dependent, ECMAScript standard (probably) does not define what's under the hood of the string implementation. But I want to know from some expert who knows V8 or SpiderMonkey from inside well enough to clarify this.
Thank you

AFAIK V8 has four string representations:
ASCII
UTF-16
concatenation of multiple strings
slice of another string
Adventures in the land of substrings and RegExps has great explanations and illustrations.
Thus, it does not have to copy the string; it just has to beginning and ending markers to the other string.
SpiderMonkey does the same thing. (See Large substrings ~9000x faster in Firefox than Chrome: why? ... though the answer for Chrome is outdated.)
This can give real speed boosts, but sometimes this is undesirable, since it can cause small strings to hold onto the memory of the larger parent string (V8 bug report)

This old blog post of mine explains it, as well as some other string representation forms: https://web.archive.org/web/20170607033600/http://blog.cdleary.com:80/2012/01/string-representation-in-spidermonkey/
Search for "dependent string". I think I know what you might be getting at with the question: they can be problematic things, at times, because if there are no references to the original, you can keep a giant string around in order to keep a bitty little substring that's actually semantically reachable. There are things that an implementation could do to mitigate that problem, like record information on a GC-generation basis to see if such one-dependent-string entities exist and collapse them to their minimal size, but last I knew of that was not being done. (Essentially with that kind of approach you're recovering runtime_refcount == 1 style information at GC-sweep time.)

Related

In Javascript, why define an array with split?

I frequently see code where people define a populated array using the split method, like this:
var colors = "red,green,blue".split(',');
How does this differ from:
var colors = ["red","green","blue"];
Is it simply to avoid having to quote each value?
Splitting a string is a bad way of creating an array. There several issues with the approach that include performance, stability and memory consumption. It requires CPU time to parse the string, it is prone to errors (double commas, spaces in the string, etc.) and means your script essentially has to store twice as much data in memory.
It's not a good idea and is most likely just a bad habit someone picked up when they first learned about strings and arrays. That or they're trying to be clever for some kind of coding exercise.
As a rule of thumb, the only time you should be parsing strings into arrays is if you're reading that string data from an external source and need to convert it to native types. If you already know the values ahead of time, you should create the array yourself.
The one possible reason someone might do this is to reduce the number of characters in their source code, trading performance for bandwidth. 'a,b,c,d,e,f,g'.split(',') is fewer characters than ['a','b','c','d','e','f','g'].
There is no difference, it's just bad practice and laziness if anything. The only reason I could think of using the first approach is if the data naturally came in string form and using an array literal made it completely unreadable.

Is JavaScript string comparison just as fast as number comparison?

I'd like to write a little library for JavaScript enums. For me to do that, I need to decide how to store the enum values. Therefore, I'd like to use the fastest way when comparing, but I also want something that is debuggable, so I'm torn between using strings or numbers. I know I could use objects too, but that would be another question
For example
// I don't want this because when debugging, you'd see just the value 0
var Planets = {Earth:0, Mars:1, Venus: 2}
// I'd prefer this so that Planets.Earth gives me a nice readable value ("Earth")
var Planets = {Earth: 'Earth', Mars: 'Mars'}
But I'm afraid that when I compare them using if (myPlanet === Planet.Earth), the string comparison could take a lot longer (say if it were in a tight loop). This should be the case because http://ecma-international.org/ecma-262/5.1/#sec-11.9.6 says
If Type(x) is String, then return true if x and y are exactly the same sequence of characters (same length and same characters in corresponding positions); otherwise, return false.
But when I wrote a test case, I found that they take the same amount of time http://jsperf.com/string-comparison-versus-number-comparison/2 so it doesn't seem like it's scanning the whole string.
I know this could be a micro optimization, but my question is: is string equality comparison done using pointers and therefore just as fast as number equality comparison?
String comparison could be "just as fast" (depending on implementation and values) - or it could be "much slower".
The ECMAScript specification describes the semantics, not the implementation. The only way to Know for Certain is to create an applicable performance benchmark on run it on a particular implementation.
Trivially, and I expect this is the case1, the effects of string interning for a particular implementation are being observed.
That is, all string values (not String Objects) from literals can be trivially interned into a pool such that implIdentityEq("foo", "foo") is true - that is, there need only one string object. Such interning can be done after constant folding, such that "f" + "oo" -> "foo" - again, per a particular implementation as long as it upholds the ECMAScript semantics.
If such interning is done, then for implStringEq the first check could be to evaluate implIdentityEq(x,y) and, if true, the comparison is trivially-true and performed in O(1). If false, then a normal string character-wise comparison would need to be done which is O(min(n,m)).
(Immediate falseness can also be determined with x.length != y.length, but that seems less relevant here.)
1 While in the above I argue for string interning being a likely cause, modern JavaScript implementations perform a lot of optimizations - as such, interning is only a small part of the various optimizations and code hoistings that can (and are) done!
I've created an "intern breaker" jsperf. The numbers agree with the hypothesis presented above.
If a string is interned then comparison is approximate in performance to testing for "identity" - while it is slower than a numeric comparison, this is still much faster than a character-by-character string comparison.
Holding the above assertion, IE10 does not appear to consider object-identity for pass-fast string comparisons although it does use a fast-fail length check.
In Chrome and Firefox, two intern'ed strings which are not equal are also compared as quickly as two that are - there is likely a special case for comparing between two different interned strings.
Even for small strings (length = 8), interning can be much faster. IE10 again shows it doesn't have this "optimization" even though it appears to have an efficient string comparison implementation.
The string comparison can fail as soon as the first different character is encountered: even comparing long strings of equal length might only compare the first few characters.
Do common JavaScript implementations use string interning? (but no references given)
Yes. In general any literal string, identifier, or other constant string in JS source is interned. However implementation details (exactly what is interned for instance) varies, as well as when the interning occurs
See JS_InternString (FF does have string interning, although where/how the strings are implicitly interened from JavaScript, I know not)
There are cases when string comparison can be much slower (comparing dynamically generated strings)
The following is 77% slower (in chrome and IE) than all the other tests
var StringEarth = 'Ear' + 'th';
for (var i = 0; i < ITERATIONS; i++) {
x = StringPlanets.Venus === StringEarth;
}
The flaw in the tests mentioned in the question is the fact that we are testing against literal strings. It seems that JavaScript is optimized so that string comparison for string literals is done just by testing a pointer. This can be observed by creating the strings dynamically. My best guess is that strings from the literal string pool are marked so that they can be compared using addresses only.
Note that string comparison seems just as fast in FF even for dynamic strings. Also, that it's just as slow for even literal strings.
Conclusion All browsers behave differently so string comparison may or may not be slower.
In general, at best String interning (making a string with a given value into a unique reference or a O(1) comparable symbol) is going to take O(n) time, as it can't do that effectively without looking at all the characters involved.
The question of relative efficiency then amounts to over how many comparisons the interning is going to be amortized.
In the limit, a very clever optimizer can pull out static expressions which build up strings and intern them once.
Some of the tests above, use strings which will have been interned in which case the comparison is potentially O(1). In the case where enums are based on mapping to integers, it will be O(1) in any implementation.
The expensive comparison cases arise when at least one of the operands is a truly dynamic string. In this case it is impossible to compare equality against it in less than O(n).
As applied to the original question, if the desire is to create something akin to an enum in a different language, the only lookout is to ensure that the interning can is done in only a few places. As pointed out above, different Browser use different implementations, so that can be tricky, and as pointed out in IE10 maybe impossible.
Caveat lacking string interning (in which case you need the integer version of the enum implementation give), #JuanMendes' string-based enum implementations will be essentially O(1) if he arranges for the value of the myPlanet variable to be set in O(1) time. If that is set using Planets.value where value is an established planet it will be O(1).

Javascript Performance: How come looping through an array and checking every value is faster than indexOf, search and match?

This came as a huge surprise for me, and I'd like to understand this result. I made a test in jsperf that is basically supposed to take a string (that is part of a URL that I'd like to check) and checks for the presence of 4 items (that are in fact, present in the string).
It checks in 5 ways:
plain indexOf;
Split the string, then indexOf;
regex search;
regex match;
Split the string, loop through the array of items, and then check if any of them matches the things it's supposed to match
To my huge surprise, number 5 is the fastest in Chrome 21. This is what I can't explain.
In Firefox 14, the plain indexOf is the fastest, that one I can believe.
I'm also surprised but Chrome uses v8, a highly optimized JavaScript engine which pulls all kinds of tricks. And the guys at Google probably have the largest set of JavaScript to run to test the performance of their implementation. So my guess is this happens:
The compiler notices that the array is a string array (type can be determine at compile time, no runtime checks necessary).
In the loop, since you use ===, builtin CPU op codes to compare strings (repe cmpsb) can be used. So no functions are being called (unlike in any other test case)
After the first loop, everything important (the array, the strings to compare against) is in CPU caches. Locality rulez them all.
All the other approaches need to invoke functions and locality might be an issue for the regexp versions because they build a parse tree.
I have added two more tests : http://jsperf.com/finding-components-of-a-url/2
The single regExp is fastest now (on Chrome). Also regExp literals are faster than string literals converted to RegExp.

Large substrings ~9000x faster in Firefox than Chrome: why?

The Benchmark: http://jsperf.com/substringing
So, I'm starting up my very first HTML5 browser-based client-side project. It's going to have to parse very, very large text files into, essentially, an array or arrays of objects. I know how I'm going to go about coding it; my primary concern right now is getting the parser code as fast as I can get it, and my primary testbed is Chrome. However, while looking at the differences between substring methods (I haven't touched JavaScript in a long, long time), I noticed that this benchmark was incredibly slow in Chrome compared to FireFox. Why?
My first assumption is that it has to do with the way FireFox's JS engine would handle string objects, and that for FireFox this operation is simple pointer manipulation, while for Chrome it's actually doing hard copies. But, I'm not sure why Chrome wouldn't do pointer manipulation or why FireFox would. Anyone have some insight?
JSPerf appears to be throwing out my FireFox results, not displaying them on the BrowserScope. For me, I'm getting 9,568,203 ±1.44% Ops/sec on .substr() in FF4.
Edit: So I see a FF3.5 performance result down there actually below Chrome. So I decided to test my pointers hypothesis. This brought me to a 2nd revision of my Substrings test, which is doing 1,092,718±1.62% Ops/sec in FF4 versus 1,195±3.81% Ops/sec in Chrome, down to only 1000x faster, but still an inexplicable difference in performance.
A postscriptum: No, I'm not concerned one lick about Internet Explorer. I'm concerned about trying to improve my skills and getting to know this language on a deeper level.
In the case of Spidermonkey (the JS engine in Firefox), a substring() call just creates a new "dependent string": a string object that stores a pointer to the thing it's a substring off and the start and end offsets. This is precisely to make substring() fast, and is an obvious optimization given immutable strings.
As for why V8 does not do that... A possibility is that V8 is trying to save space: in the dependent string setup if you hold on to the substring but forget the original string, the original string can't get GCed because the substring is using part of its string data.
In any case, I just looked at the V8 source, ans it looks like they just don't do any sort of dependent strings at all; the comments don't explain why they don't, though.
[Update, 12/2013]: A few months after I gave the above answer V8 added support for dependent strings, as Paul Draper points out.
Have you eliminated the reading of .length from your benchmark results?
I believe V8 has a few representations of a string:
1. a sequence of ASCII bytes
2. a sequence of UTF-16 code units.
3. a slice of a string (result of substring)
4. a concatenation of two strings.
Number 4 is what makes string += efficient.
I'm just guessing but if they're trying to pack two string pointers and a length into a small space, they may not be able to cache large lengths with the pointers, so may end up walking the joined link list in order to compute the length. This assumes of course that Array.prototype.join creates strings of form (4) from the array parts.
It does lead to a testable hypothesis which would explain the discrepancy even absent buffer copies.
EDIT:
I looked through the V8 source code and StringBuilderConcat is where I would start pulling, especially runtime.cc.

Performance question: String.split and then walk on the array, or RegExp?

I'll do some work on a line separated string. Which one will be faster, to split the text via String.split first and then walk on the resultant array or directly walk the whole text via a reg exp and construct the final array on the way?
Well, the best way to get your answer is to just take 2 minutes and write a loop that does it both ways a thousand times and check firebug to see which one is faster ;)
I've had to optimize a lot of string munging while working on MXHR and in my experience, plain String methods are significantly faster than RegExps in current browsers. Use RegExps on the shortest Strings possible and do everything you possibly can with String methods.
For example, I use this little number in my current code:
var mime = mimeAndPayload.shift().split('Content-Type:', 2)[1].split(";", 1)[0].replace(' ', '');
It's ugly as hell, but believe it or not it's significantly faster than the equivalent RegExp under high load.
While this is 2½ years late, hopefully this helps shed some light on the matter for any future viewers: http://jsperf.com/split-join-vs-regex-replace (Includes benchmarks results for multiple browsers, as well the functional benchmark code itself)
I expect that using split() will be much faster. It depends upon many specifics, number of lines vs. length, complexity of regex, etc.

Categories