JavaScript Objects: How do Regular Expression objects get passed? - javascript

In the pursuit of understanding JavaScript/OOP better, I'm curious how regular expression argument parameters are handled in JavaScript. I already understand a lot about regular expressions, so this isn't about interpreting patterns. This is about identifying how JavaScript handles it.
Example:
newStr = str.replace(/(^\W*|\W*$)/gi,'');
This basically trims any special characters and white-space from a string. However, /(^\W*|\W*$)/gi is not an encapsulated string, therefore, it baffles me to understand this concept since the JS object is not a string, nor a number. Is this object-type alone (i.e., regex-only), or does it serve other purposes?

It's just a special syntax that JavaScript has for regular expressions. It evaluates to an object, and is no different than:
var rex = /(^\W*|\W*$)/gi;
decision = str.replace(rex, '');
Or:
var rex = new RegExp('^\\W*|\\W*$', 'gi');
The RegExp MDN documentation has plenty of detailed info.

Regexes are first-class citizens in JavaScript, i. e. they are a separate object type.
You can construct a new RegExp object using its standard constructor:
var regex = new RegExp("(^\\W*|\\W*$)", "gi");
or using the special "regex literal" notation that allows you to cut down on backslashes:
var regex = /(^\W*|\W*$)/gi;

/(^\W*|\W*$)/gi is a regular expression literal, which is an object type in JavaScript. This type can be passed as the first parameter to the replace method, which accepts either a regex or a substring.

Is this object-type alone (i.e., regex-only)
This is correct. RegExp objects are a special type of value that's built-in to the language. They are one of only a handful of types that have "literal" representations in JavaScript.
This does make them fairly unique; there aren't any other special-purpose literals in the language. The other literals are generic types like:
null
boolean values (true/false)
numbers (1.0, 2e3, -5)
strings ('hello', "goodbye")
Arrays ([1, 2, 3])
Objects ({ name: "Bob", age: 18 })

To add to the people saying largely the same thing:
On top of the fact that it's a literal with its own syntax, you can actually access its methods in literal form:
/bob/gi.exec(" My name is Bob ");
...so long as the browser you're using is young enough to indeed support RegEx literals (it's pretty hard to find one that doesn't, these days, and if you do, does the browser support CSS?).

Related

XRegExp to replace Unicode characters in IE

I developed a javascript function to clean a range of Unicode characters. For example, "ñeóñú a1.txt" => "neonu a1.txt". For this, I used a regular expression:
var = new RegExp patternA ("[\\u0300-\\u036F]", "g");
name = name.replace (patternA,'');
But it does not work properly in IE. If my research is correct, IE does not detect Unicode in the same way. I'm trying to make an equivalent function using the library XRegExp (http://xregexp.com/), which is compatible with all browsers, but I don't know how to write the Unicode pattern so XRegExp works in IE.
One of the failed attemps:
XRegExp.replace(name,'\\u0300-\\u036F','');
How can I build this pattern?
The value provided as the XRegExp.replace method's second argument should be a regular expression object, not a string. The regex can be built by the XRegExp or the native RegExp constructor. Thus, the following two lines are equivalent:
name = name.replace(/[\u0300-\u036F]/g, '');
// Is equivalent to:
name = XRegExp.replace(name, /[\u0300-\u036F]/g, '');
The following line you wrote, however, is not valid:
var = new RegExp patternA ("[\\u0300-\\u036F]", "g");
Instead, it should be:
var patternA = new RegExp ("[\\u0300-\\u036F]", "g");
I don't know if that is the source of your problem, but perhaps. For the record, IE's Unicode support is as good or better than other browsers.
XRegExp can let you identify your block by name, rather than using magic numbers. XRegExp('[\\u0300-\\u036F]') and XRegExp('\\p{InCombiningDiacriticalMarks}') are exactly equivalent. However, the marks in that block are a small subset of all combining marks. You might actually want to match something like XRegExp('\\p{M}'). However, note that simply removing marks like you're doing is not a safe way to remove diacritics. Generally, what you're trying to do is a bad idea and should be avoided, since it will often lead to wrong or unintelligible results.

Why do two regex literals in my Javascript vary on a property?

I read in Javascript: The Good Parts by Douglas Crockford that javascript regular expression literals share the same object. If so, then how come these two regex literals vary in the lastIndex property?
var a = /a/g;
var b = /a/g;
a.lastIndex = 3;
document.write(b.lastIndex);​
JS Fiddle
0 is outputted as opposed to 3.
Section 7.8.5 of the ECMAScript Documentation makes it quite clear they are two different objects:
7.8.5 Regular Expression Literals
A regular expression literal is an input element that is converted to a RegExp object (see 15.10) each time the literal is evaluated. Two regular expression literals in a program evaluate to regular expression objects that never compare as === to each other even if the two literals' contents are identical. A RegExp object may also be created at runtime by new RegExp (see 15.10.4) or calling the RegExp constructor as a function (15.10.3).
Because they are different objects.
document.write(a === b);
Even this outputs false.
Either Crockford was wrong, or he was right at the time but times have changed.
I realize this isn't a particularly helpful or informative answer; I'm just pushing back on what I perceive as your disbelief that something Crockford wrote could be (now) false.
Do you have a reference to that claim, by the way? Would be interesting to read it in context (I don't have the book).

How to add special characters like & > in XML file using JavaScript

I am generating XML using Javascript. It works fine if there are no special characters in the XML. Otherwise, it will generate this message: "invalid xml".
I tried to replace some special characters, like:
xmlData=xmlData.replaceAll(">",">");
xmlData=xmlData.replaceAll("&","&");
//but it doesn't work.
For example:
<category label='ARR Builders & Developers'>
Thanks.
Consider generating the XML using DOM methods. For example:
var c = document.createElement("category");
c.setAttribute("label", "ARR Builders & Developers");
var s = new XMLSerializer().serializeToString(c);
s; // => "<category label=\"ARR Builder & Developers\"></category>"
This strategy should avoid the XML entity escaping problems you mention but might have some cross-browser issues.
This will do the replacement in JavaScript:
xml = xml.replace(/</g, "<");
xml = xml.replace(/>/g, ">");
This uses regular expression literals to replace all less than and greater than symbols with their escaped equivalent.
JavaScript comes with a powerful replace() method for string objects.
In general - and basic - terms, it works this way:
var myString = yourString.replace([regular expression or simple string], [replacement string]);
The first argument to .replace() method is the portion of the original string that you wish to replace. It can be represented by either a plain string object (even literal) or a regular expression.
The regular expression is obviously the most powerful way to select a substring.
The second argument is the string object (even literal) that you want to provide as a replacement.
In your case, the replacement operation should look as follows:
xmlData=xmlData.replace(/&/g,"&");
xmlData=xmlData.replace(/>/g,">");
//this time it should work.
Notice the first replacement operation is the ampersand, as if you should try to replace it later you would screw up pre-existing well-quoted entities for sure, just as "&gt;".
In addition, pay attention to the regex 'g' flag, as with it the replacement will take place all throughout your text, not only on the first match.
I used regular expressions, but for simple replacements like these also plain strings would be a perfect fit.
You can find a complete reference for String.replace() here.

What is the difference between using "new RegExp" and using forward slash notation to create a regular expression?

Is there any difference between using new RegExp("regex"); and /same_regex/ to test against a target string? I am asking this question because I got different validating result while use these two approaches. Here is the snippet I used to validate an email field:
var email="didxga#gmail.comblah#foo.com";
var regex1 = new RegExp("^[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?$");
var regex2 = /^[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?$/;
//using RegExp object
if(regex1.test(email)) {
console.log("email matched regex1");
} else {
console.log("email mismatched regex1");
}
//using slash notation
if(regex2.test(email)) {
console.log("email matched regex2");
} else {
console.log("email mismatched regex2");
}
I got two inconsistent results:
email matched regex1
email mismatched regex2
I am wondering if there is any difference here or I omitted something in this specific example?
For an executable example please refer to here
If you use the constructor to create a new RegExp object instead of the literal syntax, you need to escape the \‍ properly:
new RegExp("^[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?$")
This is necessary as in JavaScript any unknown escape sequence \x is interpreted as x. So in this case the \. is interpreted as ..
/.../ is called a regular expression literal. new RegExp uses the RegExp constructor function and creates a Regular Expression Object.
From Mozilla's developer pages
Regular expression literals provide
compilation of the regular expression
when the script is evaluated. When the
regular expression will remain
constant, use this for better
performance.
Using the constructor function
provides runtime compilation of the
regular expression. Use the
constructor function when you know the
regular expression pattern will be
changing, or you don't know the
pattern and are getting it from
another source, such as user input.
this will be a help for you
http://www.regular-expressions.info/javascript.html
see the 'How to Use The JavaScript RegExp Object' section
if you are using RegExp(regx) regx should be in string format ex:-
\w+ can be created as regx = /\w+/ or as regx = new RegExp("\\w+").
Difference is in escaping at least in your case.
When you use / / notation, you have to escape '/' with '\/', when you're using Regexp notation you escape quotes

Are /regex/ Literals always RegExp Objects?

Basically, my question is about how Javascript handles regex literals.
Contrasting with number, string and boolean where literals are primitive data types and corresponding Number, String and Boolean objects exist with seamless type conversion, are regex literals anonymous instances of the RegExp object or is this a case of regex being treated like primitive data with seamless type conversion to RegExp?
"The complete Reference Javascript, 2nd edition, Powell and Schneider (MH)" contradicts itself - at one place the authors say that /regex/ is automatically typecasted into RegExp when needed and at another place they say that /regex/ is nothing but an instance of RegExp!
EDIT: Please provide a reference to a reliable source
Here's what the spec has to say:
A regular expression literal is an input element that is converted to a RegExp object when it is scanned. The object is created before evaluation of the containing program or function begins. Evaluation of the literal produces a reference to that object; it does not create a new object. Two regular expression literals in a program evaluate to regular expression objects that never compare as === to each other even if the two literals' contents are identical.
There is no primitive regex type that autoboxes to an object in the same way as string or number.
Note, however, that not all browsers implement the "instantiate-once-per-literal" behavior, including Safari and IE6 (and possibly later), so portable code shouldn't depend on it. The abortive ECMAScript 4 draft would have changed the behavior to match those browsers:
In ES3 a regular expression literal like /a*b/mg denotes a single unique RegExp object that is created the first time the literal is encountered during evaluation. In ES4 a new RegExp object is created every time the literal is encountered during evaluation.
Also, some browsers (Firefox <3, Safari) report typeof /regex/ as "function", so portable code should avoid typeof on RegExp instances—stick with instanceof.
Yes, the following two expressions are equivalent:
var r1 = /ab+c/i,
r2 =new RegExp("ab+c", "i");
The constructor property of both points to the RegExp constructor function:
(/ab+c/i).constructor === RegExp // true
r2.constructor === RegExp // true
And a regexp literal is an instance of RegExp:
/ab+c/i instanceof RegExp // true
The basic difference is that defining regular expressions using the constructor function allows you to build and compile an expression from a string. This can be very useful for constructing complex expressions that will be re-used.
Yes, new RegExp("something", "g") is the same as /something/g

Categories