Basically, my question is about how Javascript handles regex literals.
Contrasting with number, string and boolean where literals are primitive data types and corresponding Number, String and Boolean objects exist with seamless type conversion, are regex literals anonymous instances of the RegExp object or is this a case of regex being treated like primitive data with seamless type conversion to RegExp?
"The complete Reference Javascript, 2nd edition, Powell and Schneider (MH)" contradicts itself - at one place the authors say that /regex/ is automatically typecasted into RegExp when needed and at another place they say that /regex/ is nothing but an instance of RegExp!
EDIT: Please provide a reference to a reliable source
Here's what the spec has to say:
A regular expression literal is an input element that is converted to a RegExp object when it is scanned. The object is created before evaluation of the containing program or function begins. Evaluation of the literal produces a reference to that object; it does not create a new object. Two regular expression literals in a program evaluate to regular expression objects that never compare as === to each other even if the two literals' contents are identical.
There is no primitive regex type that autoboxes to an object in the same way as string or number.
Note, however, that not all browsers implement the "instantiate-once-per-literal" behavior, including Safari and IE6 (and possibly later), so portable code shouldn't depend on it. The abortive ECMAScript 4 draft would have changed the behavior to match those browsers:
In ES3 a regular expression literal like /a*b/mg denotes a single unique RegExp object that is created the first time the literal is encountered during evaluation. In ES4 a new RegExp object is created every time the literal is encountered during evaluation.
Also, some browsers (Firefox <3, Safari) report typeof /regex/ as "function", so portable code should avoid typeof on RegExp instances—stick with instanceof.
Yes, the following two expressions are equivalent:
var r1 = /ab+c/i,
r2 =new RegExp("ab+c", "i");
The constructor property of both points to the RegExp constructor function:
(/ab+c/i).constructor === RegExp // true
r2.constructor === RegExp // true
And a regexp literal is an instance of RegExp:
/ab+c/i instanceof RegExp // true
The basic difference is that defining regular expressions using the constructor function allows you to build and compile an expression from a string. This can be very useful for constructing complex expressions that will be re-used.
Yes, new RegExp("something", "g") is the same as /something/g
Related
The below excerpts refer to ECMAScript 2017.
11.8.4.2 Static Semantics: StringValue
StringLiteral::
"DoubleStringCharactersopt"
'SingleStringCharactersopt'
1. Return the String value whose elements are the SV of this StringLiteral.
11.8.4.3 Static Semantics: SV
A string literal stands for a value of the String type. The String
value (SV) of the literal is described in terms of code unit values
contributed by the various parts of the string literal.
Questions
In the excerpts above, the following terms appear:
string literal
Nonterminal symbol StringLiteral
String value
SV
Could someone help explain the difference between these terms?
Also, what does the last sentence in 11.8.4.2 mean?
A string literal is the thing that you, a human writing or reading code, can recognize as the sequence "..." or '...'
The token StringLiteral is a nonterminal in the formal grammar of EMCAScript that can be replaced by a terminal that is an actual string literal.
A string value is the semantic content of a string literal. The spec says
The String value (SV) of the literal is ...
Therefore, we may be sure that a string literal has a string value: the string value of some string literal is a collection of code unit values.
The identifier SV appears to be shorthand for (and used interchangeably with) "string value".
Also, what does the last sentence in 11.8.4.2 mean?
Every nonterminal "returns" some value when it is evaluated. The line
Return the String value whose elements are the SV of this StringLiteral.
simply means that when the parser finds a StringLiteral in the text of a program, the result of parsing that nonterminal is the string value (i.e., collection of code unit values) associated with the just-parsed StringLiteral.
A lot of the terminology you're looking at is really of value to JavaScript platform maintainers; in practical terms, you almost certainly already know what a "string" is. The other terms are useful for reading the spec.
The term StringLiteral refers to a piece of JavaScript source code that a JavaScript programmer would look at and call "a string"; in other words, in
let a = "hello world";
the StringLiteral is that run of characters on the right side of the = from the opening double-quote to the closing double-quote. It's a "nonterminal" because it's not a "terminal" symbol in the definition of the grammar. Language grammars are built from terminal symbols at the lowest level and non-terminals to describe higher-level subsections of a program. The bold-faced double-quote characters you see in the description of a double-quoted string are examples of terminal symbols.
The term StringValue refers to an internal operation that applies to several components of the grammar; for StringLiteral it has the fairly obvious definition you posted. Semantic rules are written in terms of non-terminals that make up some grammar concept.
The term String value or SV is used for describing the piece-by-piece portions of a string.
The JavaScript spec is particularly wacky with terminology, because the language committee is stuck with describing semantics that evolved willy-nilly in the early years of language adoption. Inventing layers of terminology with much apparent redundancy is a way of coping with the difficulty of creating unambiguous descriptions of what bits of code are supposed to do, down to the last detail and weird special case. It's further complicated by the fact that (for reasons unknown to me) the lexical grammar is broken down in as much excruciating detail as are higher-level constructs, so that really compounds the nit-picky feel of the spec.
An example of when knowing that expanse of terminology would be useful might be an explanation of why it's necessary to "double-up" on backslashes when building a regular expression from a string literal instead of a regular expression literal. It's clear that a call to the RegExp constructor:
var r = new RegExp("foo\\.bar");
has an expression consisting of just one StringLiteral. To make the call to the constructor, then, the semantic rules for that operation will at some point call for getting the StringValue (and thus SV) of that literal, and those rules contain the details for every piece of the literal. That's where you come across the fact that the SV semantics have rules for backslashes, and in particular one that says two backslashes collapse to one.
Now I'm not saying that that explanation would be better than a simple explanation, but it's explicitly clear about every detail of the question.
I was learning about javascript string methods here.
Under section Extracting String Characters, it said:
There are 2 safe methods for extracting string characters:
charAt(position)
charCodeAt(position)
The questions here are:
Why these methods are called safe?
What are these methods protecting from?
There are two ways to access a character from a string.
// Bracket Notation
"Test String1"[6]
// Real Implementation
"Test String1".charAt(6)
It is a bad idea to use brackets, for these reasons (Source):
This notation does not work in IE7.
The first code snippet will return
undefined in IE7. If you happen to use
the bracket notation for strings all
over your code and you want to migrate
to .charAt(pos), this is a real pain:
Brackets are used all over your code
and there's no easy way to detect if
that's for a string or an
array/object.
You can't set the character using this notation. As there is no warning of
any kind, this is really confusing and
frustrating. If you were using the
.charAt(pos) function, you would not
have been tempted to do it.
Also, it can produce unexpected results in edge cases
console.log('hello' [NaN]) // undefined
console.log('hello'.charAt(NaN)) // 'h'
console.log('hello' [true]) //undefined
console.log('hello'.charAt(true)) // 'e'
Basically, it's a short-cut notation that is not fully implemented across all browsers.
Note, you are not able to write characters using either method. However, that functionality is a bit easier to understand with the .charAt() function which, in most languages, is a read-only function.
So for the compatibility purpose .charAt is considered to be safe.
Source
Speed Test: http://jsperf.com/string-charat-vs-bracket-notation
Testing in Chrome 47.0.2526.80 on Mac OS X 10.10.4
Test Ops/sec
String charAt
testCharAt("cat", 1);
117,553,733
±1.25%
fastest
String bracket notation
testBracketNotation("cat", 1);
118,251,955
±1.56%
fastest
In the pursuit of understanding JavaScript/OOP better, I'm curious how regular expression argument parameters are handled in JavaScript. I already understand a lot about regular expressions, so this isn't about interpreting patterns. This is about identifying how JavaScript handles it.
Example:
newStr = str.replace(/(^\W*|\W*$)/gi,'');
This basically trims any special characters and white-space from a string. However, /(^\W*|\W*$)/gi is not an encapsulated string, therefore, it baffles me to understand this concept since the JS object is not a string, nor a number. Is this object-type alone (i.e., regex-only), or does it serve other purposes?
It's just a special syntax that JavaScript has for regular expressions. It evaluates to an object, and is no different than:
var rex = /(^\W*|\W*$)/gi;
decision = str.replace(rex, '');
Or:
var rex = new RegExp('^\\W*|\\W*$', 'gi');
The RegExp MDN documentation has plenty of detailed info.
Regexes are first-class citizens in JavaScript, i. e. they are a separate object type.
You can construct a new RegExp object using its standard constructor:
var regex = new RegExp("(^\\W*|\\W*$)", "gi");
or using the special "regex literal" notation that allows you to cut down on backslashes:
var regex = /(^\W*|\W*$)/gi;
/(^\W*|\W*$)/gi is a regular expression literal, which is an object type in JavaScript. This type can be passed as the first parameter to the replace method, which accepts either a regex or a substring.
Is this object-type alone (i.e., regex-only)
This is correct. RegExp objects are a special type of value that's built-in to the language. They are one of only a handful of types that have "literal" representations in JavaScript.
This does make them fairly unique; there aren't any other special-purpose literals in the language. The other literals are generic types like:
null
boolean values (true/false)
numbers (1.0, 2e3, -5)
strings ('hello', "goodbye")
Arrays ([1, 2, 3])
Objects ({ name: "Bob", age: 18 })
To add to the people saying largely the same thing:
On top of the fact that it's a literal with its own syntax, you can actually access its methods in literal form:
/bob/gi.exec(" My name is Bob ");
...so long as the browser you're using is young enough to indeed support RegEx literals (it's pretty hard to find one that doesn't, these days, and if you do, does the browser support CSS?).
I read in Javascript: The Good Parts by Douglas Crockford that javascript regular expression literals share the same object. If so, then how come these two regex literals vary in the lastIndex property?
var a = /a/g;
var b = /a/g;
a.lastIndex = 3;
document.write(b.lastIndex);
JS Fiddle
0 is outputted as opposed to 3.
Section 7.8.5 of the ECMAScript Documentation makes it quite clear they are two different objects:
7.8.5 Regular Expression Literals
A regular expression literal is an input element that is converted to a RegExp object (see 15.10) each time the literal is evaluated. Two regular expression literals in a program evaluate to regular expression objects that never compare as === to each other even if the two literals' contents are identical. A RegExp object may also be created at runtime by new RegExp (see 15.10.4) or calling the RegExp constructor as a function (15.10.3).
Because they are different objects.
document.write(a === b);
Even this outputs false.
Either Crockford was wrong, or he was right at the time but times have changed.
I realize this isn't a particularly helpful or informative answer; I'm just pushing back on what I perceive as your disbelief that something Crockford wrote could be (now) false.
Do you have a reference to that claim, by the way? Would be interesting to read it in context (I don't have the book).
Is there any difference between using new RegExp("regex"); and /same_regex/ to test against a target string? I am asking this question because I got different validating result while use these two approaches. Here is the snippet I used to validate an email field:
var email="didxga#gmail.comblah#foo.com";
var regex1 = new RegExp("^[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?$");
var regex2 = /^[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?$/;
//using RegExp object
if(regex1.test(email)) {
console.log("email matched regex1");
} else {
console.log("email mismatched regex1");
}
//using slash notation
if(regex2.test(email)) {
console.log("email matched regex2");
} else {
console.log("email mismatched regex2");
}
I got two inconsistent results:
email matched regex1
email mismatched regex2
I am wondering if there is any difference here or I omitted something in this specific example?
For an executable example please refer to here
If you use the constructor to create a new RegExp object instead of the literal syntax, you need to escape the \ properly:
new RegExp("^[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?$")
This is necessary as in JavaScript any unknown escape sequence \x is interpreted as x. So in this case the \. is interpreted as ..
/.../ is called a regular expression literal. new RegExp uses the RegExp constructor function and creates a Regular Expression Object.
From Mozilla's developer pages
Regular expression literals provide
compilation of the regular expression
when the script is evaluated. When the
regular expression will remain
constant, use this for better
performance.
Using the constructor function
provides runtime compilation of the
regular expression. Use the
constructor function when you know the
regular expression pattern will be
changing, or you don't know the
pattern and are getting it from
another source, such as user input.
this will be a help for you
http://www.regular-expressions.info/javascript.html
see the 'How to Use The JavaScript RegExp Object' section
if you are using RegExp(regx) regx should be in string format ex:-
\w+ can be created as regx = /\w+/ or as regx = new RegExp("\\w+").
Difference is in escaping at least in your case.
When you use / / notation, you have to escape '/' with '\/', when you're using Regexp notation you escape quotes