I am writing JavaScript to Python translator and "\8" and "\9" are causing me lot's of problems. According to the documentation something like "\8" or "\9" is illegal since they are not valid octal escapes. Esprima parser throws exception on such literal. However JS engines they seem to allow them and they evaluate to "8" and "9" respectively.
Therefore:
/\8/.exec("\8")
RegExp('\\8').exec('\8')
/\8/.exec("8")
RegExp('\\8').exec('8')
Should all return a match since /\8/ should be the same as /8/. However the results are inconsistent across JS engines and some return a match while others don't (for example Safari's).
What's the reason for all these differences? And what is the right way - how to handle other cases involving these literals?
You're right that the spec does not allow for it but no one ever said that JS engines are perfect.
The "right" way to handle those cases is to report them as a syntax error, given that this isn't valid in JS nor Python*.
*As far as I know. I don't write a lot of Python but a quick Googling seems to indicate it isn't.
Related
I'm refactoring a rather large RegExp into a function that returns a RegExp. As a backward-compatibility test, I compared the .source of the returned RegExp with the .source of the old RegExp:
getRegExp(/* in the case requiring backward compatibility there's no arguments */)
.source == oldRegExp.source
However, I've noticed that the old RegExp contains various excessive backslashes like [\.\w] instead of [.\w]. I'd like to refactor such bits, but there's a number of them and it would be nice to have a similar check (backward compability is not broken). The problem is, /[\.\w]/.source != /[.\w]/.source. And identifying which backslashes may be removed automatically is not trivial (\. and . are not the same outside [...] and may be in some other cases).
Are you aware of somewhat simple ways to do so? It seems this can only be done by actual parsing of the .source (compare the example above with /\[\.\w]\/ and /\[.\w]\/), but may be I'm missing some trick of utilizing browser's built-in properties/methods. The point is, '\"' == '"' is true, so strings defined with these different syntaxes are stored as "normalized" values ("), I wonder if such "normalized" pattern is available for a RegExp.
Sadly, comparing two regular expressions to see if they're the same is exactly the same as comparing any other two pieces of code - ie, hard.
The only real way I know of to do this is to create a suite of tests, each one targeting a specific aspect of the regular expression and verifying that it works properly. This is not an easy process-regular expressions are subtle and complex with a lot of potential for unrealized side effects. I recently had to fix some defects in a regex based address parser and it took about a thousand unit tests before I was satisfied with my coverage... but then as soon as I started to change the regex MY TESTS CAUGHT STUFF CONSTANTLY!!
Unit testing sucks and it's just tiring and not fun, but for almost any piece of logic it has real value, and when using powerful tools like regex, I would say it's absolutely crucial.
I have applied the same method to replace "-" with "_" in c++ and it is working properly but in javascript, it is not replacing at all.
function replace(str)
{
for(var i=0;i<str.length;i++)
{
if(str.charAt(i)=="-")
{
str.charAt(i) = "_";
}
}
return str;
}
It's simple enough in javascript, it's not really worth making a new function:
function replace(str){
return str.replace(/-/g, '_') // uses regular expression with 'g' flag for 'global'
}
console.log(replace("hello-this-is-a-string"))
This does not alter the original string, however, because strings in javascript are immutable.
If you are dead set on avoiding the builtin(maybe you want to do more complex processing), reduce() can be useful:
function replace(str){
return [...str].reduce((s, letter) => s += letter == '-' ? '_' : letter , '')
}
console.log(replace("hello-this-is-a-string"))
This is yet another case of "make sure you read the error message". In the case of
str.charAt(i) = "_";
the correct description of what happens is not "it is not replacing at all", as you would have it; it is "it generates a run-time JavaScript error", which in Chrome is worded as
Uncaught ReferenceError: Invalid left-hand side in assignment
In Firefox, it is
ReferenceError: invalid assignment left-hand side
That should have given you the clue you needed to track down your problem.
I repeat: read error messages closely. Then read them again, and again. In the great majority of cases, they tell you exactly what the problem is (if you only take the time to try to understand them), or at least point you in right direction.
Of course, reading the error message assumes you know how to view the error message. Do you? In most browsers, a development tools window is available--often via F12--which contains a "console", displaying error messages. If you don't know about devtools, or the console, then make sure you understand them before you write a single line of code.
Or, you could have simply searched on Stack Overflow, since (almost) all questions have already been asked there, especially those from newcomers. For example, I searched in Google for
can i use charat in javascript to replace a character in a string?
and the first result was How do I replace a character at a particular index in JavaScript?, which has over 400 upvotes, as does the first and accepted answer, which reads:
In JavaScript, strings are immutable, which means the best you can do is create a new string with the changed content, and assign the variable to point to it.
As you learn to program, or learn a new languages, you will inevitably run into things you don't know, or things that confuse you. What to do? Posting to Stack Overflow is almost always the worst alternative. After all, as you know, it's not a tutorial site, and it's not a help site, and it's not a forum. It's a Q&A site for interesting programming questions.
In the best case, you'll get snarky comments and answers which will ruin your day; in the worst case, you'll get down-voted, and close-voted, which is not just embarrassing, but may actually prevent you from asking questions in the future. Since you want to make sure you are able to ask questions when you really need to, you are best off taking much more time doing research, including Google and SO searches, on simple beginner questions before posting. Or find a forum which is designed to help new folks. Or ask the person next to you if there is one. Or run through one or more tutorials.
But why write it yourself at all?
However, unless you are working on this problem as a way of teaching yourself JavaScript, as a kind of training exercise, there is no reason to write it at all. It has already been written hundreds, or probably thousands, of times in the history of computing. And the overwhelming majority of those implementations are going to be better, faster, cleaner, less buggy, and more featureful than whatever you will write. So your job as a "programmer" is not to write something that converts dashes to underscores; it's to find and use something that does.
As the wise man said, today we don't write algorithms any more; we string together API calls. Our job is to find, and understand, the APIs to call.
Finding the API is not at all hard with Google. In this case, it could be helpful if you knew that strings with underscores are sometimes called "snake-cased", but even without knowing that you can find something on the first page of Google results with a query such as "javascript convert string to use underscores library".
The very first result, and the one you should take a look at, is underscore.string, a collection of string utilities written in the spirit of the versatile "underscore" library, and designed to be used with it. It provides an API called underscored. In addition to dealing with "dasherized" input (your case), it also handles other string/identifier formats such as "camelCase".
Even if you don't want to import this particular library and use it (which you probably should), you would be much better off stealing or cloning its code, which in simplified form is:
str
.replace(/([a-z\d])([A-Z]+)/g, '$1_$2')
.replace(/[-\s]+/g, '_')
This is not as complicated as it looks. The first part is to deal with camelCase input, so that "catsAndDogs" becomes "cats-and-dogs". the second line is what handles the dashes and spaces). Look closely--it replaces runs of multiple dashes with a single underscore. That could easily be something that you want to do too, depending on who is doing what with the transformed string downstream. That's a perfect example of something that someone else writing a professional-level library for this task has thought of that you might not when writing your home-grown solution.
Note that this well-regarded, finely-turned library does not use the split/join trick to do global replaces in strings, as another answer suggests. That approach went out of use almost a decade ago.
Besides saving yourself the trouble of writing it yourself, and ending up with better code, if you take time time to understand what it's doing, you will also end up knowing more about regular expressions.
You can easily replace complete string using .split() and .join().
function replace(str){
return str.split("-").join("_");
}
console.log(replace("hello-this-is-a-string"))
In javascript, I need to take a string and HTML un-escape it.
This question over here asks the same question, and the most popular answer involves populating a temporary div.
I've used this as well, but I think I've found a bug.
Simple example, correct behavior
If you have this string: Cats>Dogs
Unescaped, it should be: Cats>Dogs
Malformed example, wrong behavior
If you remove the semicolon and use this instead:Cats>Dogs
You will get this as a result: Cats>Dogs
Isn't that wrong?
This struck me as odd. From what I understand, an escaped string requires the presence of a terminating semicolon, otherwise it's not escaped. After all, what if I had a store called guitars&s? For all we know, this company exists but gets no business because it causes null reference exceptions everywhere it has records.
Any ideas on how I could perform escaping while knowingly avoiding escaping when the semicolon is missing? Currently, all I can think to do is perform the unescaping myself.
(The WYSIWYG preview in StackOverflow exhibits a similar unusual behavior, by the way. Try entering &gt;, this renders as >!)
Isn't that wrong?
Successful HTML parsers are tolerant. This is one of the things distinguishing them from, say, XML parsers. They don't necessarily stick to strict rules about markup, for the simple reason that there's a lot of incorrect markup out there. So they try to figure out what the markup is meant to represent. >Dogs is more likely to mean >Dogs than >Dogs, so that's what the parser goes with.
I was wondering about the working of parentheses in Javascript, so I wrote this code to test:
((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((
4+4
))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
Which consists in:
( x1174
4+4
) x1174
I tested the code above on Google Chrome 20 (Win64), and it gives me the right answer (8).
But if I try the same code, but with 1175 parentheses (on both sides), I get a stackoverflow error.
You can check this code in JSFiddle (Note: in JSFiddle it stops working with 1178 parentheses)
So, my questions are:
Why does it happen?
Why does it stop working with 1178 parentheses on JSFiddle but with only 1175 on my blank page?
Does this error depend of the page/browser/os?
Often languages are parsed by code designed along a pattern called recursive descent. I don't know for sure that that's the case here, but certainly the "stack overflow" error is a big piece of evidence.
The idea is that to parse an expression, you approach the syntax by looking at what an expression can be. A parenthesized expression is like an "expression within an expression". Thus, for a parser to systematically parse some expression in code it's seeing for the first time (which, for a parser, is its eternal fate), a left parenthesis means "ok - hold on to what you're doing (on the stack), and go start from the beginning of what an expression can possibly be and parse a fresh, complete expression, and come back when you see the matching close paren".
Thus, a string of a thousand or more parenthesis triggers an equivalent cascade of that same activity: put what we've got on a shelf; dive in and get a sub-expression, and then resume when we know what it looks like.
Now this is not the only way to parse something, it should be noted. There are many ways. I personally am a huge fan of recursive descent parsing, but there's nothing particularly special about it (except that I think it'll someday result in me seeing a real unicorn).
The behavior is different in different browsers because they have different implementations of Javascript. The language doesn't specify how something like this should fail, so each implementation fails in a different way.
The difference between JSFiddle and your blank page is because JSFiddle itself uses a few stack frames to establish the environment in which to run your code.
I'm trying to understand how JS is actually parsed. But my searches either return some ones very vaguely documented project of a "parser/generator" (i don't even know what that means), or how to parse JS using a JS Engine using the magical "parse" method. I don't want to scan through a bunch of code and try all my life to understand (although i can, it would take too long).
i want to know how an arbitrary string of JS code is actually turned into objects, functions, variables etc. I also want to know the procedures, and techniques that turns that string into stuff, gets stored, referenced, executed.
Are there any documentation/references for this?
Parsers probably work in all sorts of ways, but fundamentally they first go through a stage of tokenisation, then give the result to the compiler, which turns it into a program if it can. For example, given:
function foo(a) {
alert(a);
}
the parser will remove any leading whitespace to the first character, the letter "f". It will collect characters until it gets something that doesn't belong, the whitespace, that indicates the end of the token. It starts again with the "f" of "foo" until it gets to the "(", so it now has the tokens "function" and "foo". It knows "(" is a token on its own, so that's 3 tokens. It then gets the "a" followed by ")" which are two more tokens to make 5, and so on.
The only need for whitespace is between tokens that are otherwise ambiguous (e.g. there must be either whitespace or another token between "function" and "foo").
Once tokenisation is complete, it goes to the compiler, which sees "function" as an identifier, and interprets it as the keyword "function". It then gets "foo", an identifier that the language grammar tells it is the function name. Then the "(" indicates an opening grouping operator and hence the start of a formal parameter list, and so on.
Compilers may deal with tokens one at a time, or may grab them in chunks, or do all sorts of weird things to make them run faster.
You can also read How do C/C++ parsers work?, which gives a few more clues. Or just use Google.
While it doesn't correspond closely to the way real JS engines work, you might be interested in reading Douglas Crockford's article on Top Down Operator Precedence, which includes code for a small working lexer and parser written in the Javascript subset it parses. It's very readable and concise code (with good accompanying explanations) which at least gives you an outline of how a real implementation might work.
A more common technique than Crockford's "Top Down Operator Precedence" is recursive descent parsing, which is used in Narcissus, a complete implementation of JS in JS.
maybe esprima will help you to understand how JS parses the grammar. it's online