Finding beginning and end quotations - javascript

I'm starting to write a code syntax highlighter in JavaScript, and I want to highlight text that is in quotes (both "s and 's) in a certain color. I need it be able to not be messed up by one of one type of quote being in the middle of a pair of the other quotes as well, but i'm really not sure where to even start. I'm not sure how I should go about finding the quotes and then finding the correct end quote.

Unless you're doing this for the challenge, have a look at Google Code Prettify.
For your problem, you could read up on parsing (and lexers) at Wikipedia. It's a huge topic and you'll find that you'll come upon bigger problems than parsing strings.
To start, you could use regular expressions (although they rarely have the accuracy of a true lexer.) A typical regular expression for matching a string is:
/"(?:[^"\\]+|\\.)*"/
And then the same for ' instead of ".
Otherwise, for a character-by-character parser, you would set some kind of state that you're in a string once you hit ", then when you hit " that is not preceded by an uneven amount of backslashes (an even amount of backslashes would escape eachother), you exit the string.

You can find quotes using regular expressions but if you're writing a syntax highlighter then the only reliable way is to step through the code, character by character, and decide what to do from there.
E.g. of a Regex
/("|')((?:\\\1|.)+?)\1/g
(matches "this" and 'this' and "thi\"s")

use stack.. if unmatched quote found push it.. if match found pop

I did it with a single regular expression in php using backwards references. JS does not support it and i think that's what you need if you really want to detect undefined backslashes.

Related

Removing javascript reserved words from string

So, I have a string which is actually a javascript script. I need to remove first reserved javascript word from it, but only if it actually has the meaning of the reserved word. That means:
it can't be inside string literals ("" or '', like "return that thing to me");
it has to be preceded and followed by whitespace, linebreak and such;
any other cases where it's not a reserved word.
I have the hard time trying to write RegExp for this, as there always seems to be at least one case it doesn't work as intended.
Any help, please?
You have to use a more powerful method than regex - such as syntax analyzer to break your string into an abstract syntax tree. Then look for any keyword you want.
Try using the parser API of the Spider Money.
Or a library like UglifyJS or Esprima

Escape characters in jQuery variables not working

I'm having issues with escaping characters (namely period) found in variables when using selectors in jQuery. I was going to type this all out, but it was just easier taking a screenshot of my console window in Chrome.
It looks like the variables and the clear text versions match up. I expect $('#'+escName) to return the div, just like $('#jeffrey\\.lamb') returns a div. It does not. Why?
You have to think in terms of the individual parsers that will be examining your string values. The very first one, of course, is the JavaScript parser itself. Backslash characters have a meaning in the string grammar, so if you want a single backslash in a string it needs to be doubled.
After the string is parsed from the source code into an internal string value, the next thing that'll pay attention to its contents (in this case) is the CSS selector evaluator (either Sizzle or the native querySelector code; not sure which in the case of strings with escapes like this). That code only needs one backslash to quote the . in order that it not be interpreted as introducing a class name match.
Thus, escName = "jeffrey\\.lamb"; is all you need in this case.

How To Create This RegExp

I am looking to find this in a string: XXXX-XXX-XXX Where the X is any number.
I need to find this in a string using JavaScript so bonus points to those who can provide me the JavaScript too. I tried to create a regex and came out with this: ^[0-9]{4}\-[0-9]{3}\-[0-9]{3}$
Also, I would love to know of any cheat sheets or programs you guys use to create your regular expressions.
i suppose this is what you want:
\d{4}-\d{3}-\d{3}
in doubt? Google for "RegEx Testers"
With your attempt:
^[0-9]{4}\-[0-9]{3}\-[0-9]{3}$
Since the - is not a metacharacter, there is no need to escape it -- thus you are looking for explicit backslash characters.
Also, you've anchored the match at the beginning and end of the string -- this will match only strings that consist only of your number. (Well, assuming the rest were correct.)
I know most people like the {3} style of counting, but when the thing being matched is a single digit, I find this more legible:
\d\d\d\d-\d\d\d-\d\d\d
Obviously if you wanted to extend this to matching hexadecimal digits, extending this one would be horrible, but I think this is far more legible than alternatives:
\d{4}-\d{3}-\d{3}
[[:digit:]]{4}-[[:digit:]]{3}-[[:digit:]]{3}
[0-9]{4}-[0-9]{3}-[0-9]{3}
Go with whatever is easiest for you to read.
I tend to use the perlre(1) manpage as my main reference, knowing full well that it is far more featureful than many regexp engines. I'm prepared to handle the differences considering how conveniently available the perlre manpage is on most systems.
var result = (/\d{4}\-\d{3}\-\d{3}/).exec(myString);

Extracting data from JavaScript (Python Scraper)

I'm currently using a fusion of urllib2, pyquery, and json to scrape a site, and now I find that I need to extract some data from JavaScript. One thought would be to use a JavaScript engine (like V8), but that seems like overkill for what I need. I would use regular expressions, but the expression for this seems way to complex.
JavaScript:
(function(){DOM.appendContent(this, HTML("<html>"));;})
I need to extract the <html>, but I'm not entirely sure how to do so. The <html> itself can contain basically every character under the sun, so [^"] won't work.
Any thoughts?
Why regex? Can't you just use two substrings as you know how many characters you want to trim off the beginning and end?
string[42:-7]
As well as being quicker than a regex, it then doesn't matter if quotes inside <html> are escaped or not.
If every occurance of " inside the html code would be escaped by using \" (it is a JavaScript string after all), you could use
HTML\("((?:\\"|.)*?)"\)
to get the parameter to HTML into the first capturing group.
Note that this Regex is not yet escaped to be a Javascript String itself.

Detecting regular expression in content during parse

I am writing a simple parser for C. I was just running it with some other language files (for fun - to see the extent of C-likeness and laziness - don't wanna really write separate parsers for each language if I can avoid it).
However the parser seems to break down for JavaScript if the code being parsed contains regular expressions...
Case 1:
For example, while parsing the JavaScript code snippet,
var phone="(304)434-5454"
phone=phone.replace(/[\(\)-]/g, "")
//Returns "3044345454" (removes "(", ")", and "-")
The '(', '[' etc get matched as starters of new scopes, which may never be closed.
Case 2:
And, for the Perl code snippet,
# Replace backslashes with two forward slashes
# Any character can be used to delimit the regex
$FILE_PATH =~ s#\\#//#g;
The // gets matched as a comment...
How can I detect a regular expression within the content text of a "C-like" program-file?
It is impossible.
Take this, for example:
m =~ s/a/b/g;
Could be both C or perl.
One minute's thinking reveals, that the number of perl style regular expressions that are also sntyctically valid C expressions is infinite.
Another example:
m+foo *bar[index]+i
The best you can get is some extreme vague guesswork. The difficulty stems from the fact that a regular expression is a sequence of characters that can be virtually everything.
You better clean up your error handling. A parser should not "break down" if some parenthesis are missing or superfluous ones are seen.
Well, your token grammar has to take regex syntax into consideration. Classic parsers consist of two layers: something to tokenize the input, and then something to parse the grammar. The syntax of the language is generally expressed in terms of tokens, so the job of the tokenizer is to feed a stream of those to the parser. Generally the tokens them selves are regular expressions, or more properly a great big regex of things ORed together. At each character position on the input, one of the token regexes must match or else the character is invalid.
Now, there are other parsing techniques that sort-of squish together the tokenization with the parsing. ("PEG" parsers for example)
edit — another note: you can't parse languages like Javascript or Perl with just a regular expression.

Categories