So, I have a string which is actually a javascript script. I need to remove first reserved javascript word from it, but only if it actually has the meaning of the reserved word. That means:
it can't be inside string literals ("" or '', like "return that thing to me");
it has to be preceded and followed by whitespace, linebreak and such;
any other cases where it's not a reserved word.
I have the hard time trying to write RegExp for this, as there always seems to be at least one case it doesn't work as intended.
Any help, please?
You have to use a more powerful method than regex - such as syntax analyzer to break your string into an abstract syntax tree. Then look for any keyword you want.
Try using the parser API of the Spider Money.
Or a library like UglifyJS or Esprima
Related
I'm trying to figure out a regex pattern to parse a CSS file, looking for any instances of $, unless it's part of an attribute selector ($=, like in [attr$=foo]).
In other words, I'm looking for a way to find a string unless it's followed by another string. Not sure how to do that.
The script will run on node.js, v8.9.1 w/o flags, so I don't think I have Lookbehind.
Thnx/
You can try this :
str.match(/(\$)[^=]?/g);
You will have all "$" not followed by "=" in the 1st capturing group.
I'm having issues with escaping characters (namely period) found in variables when using selectors in jQuery. I was going to type this all out, but it was just easier taking a screenshot of my console window in Chrome.
It looks like the variables and the clear text versions match up. I expect $('#'+escName) to return the div, just like $('#jeffrey\\.lamb') returns a div. It does not. Why?
You have to think in terms of the individual parsers that will be examining your string values. The very first one, of course, is the JavaScript parser itself. Backslash characters have a meaning in the string grammar, so if you want a single backslash in a string it needs to be doubled.
After the string is parsed from the source code into an internal string value, the next thing that'll pay attention to its contents (in this case) is the CSS selector evaluator (either Sizzle or the native querySelector code; not sure which in the case of strings with escapes like this). That code only needs one backslash to quote the . in order that it not be interpreted as introducing a class name match.
Thus, escName = "jeffrey\\.lamb"; is all you need in this case.
In the text page, I would like to examine each word. What is the best way to read each word at the time? It is easy to find words that are surrounded by space, but once you get into parsing out words in text it can get complicated.
Instead of defining my own way of parsing the words from text, is there something already built that parse out the words in regular expression or other methods?
Some example of words in text.
word word. word(word) word's word word' "word" .word. 'word' sub-word
You can use:
text = "word word. word(word) word's word word' \"word\" .word. 'word' sub-word";
words = text.match(/[-\w]+/g);
This will give you an array with all your words.
In regular expressions, \w means any character that is either a-z, A-Z, 0-9 or _. [-\w] means any character that is a \w or a -. [-\w]+ means any of these characters that appear 1 ore more times.
If you would like to define a word as being something more than the above expression, add the other characters that compose your words inside the [-\w] character class. For example, if you'd like words to also contain ( and ), make the character class be [-\w()].
For an introduction in regular expressions, check out the great tutorial at regular-expressions.info.
What you're talking about is Tokenisation. It's non-trivial to say the least, and a subject of intense reasearch at the major search engines. There are a number of open source tokenisation libraries in various server-side languages (e.g see the Stanford NLP and Lucene projects) but as far as I am aware there's nothing that would even come close to these in javascript. You may have to roll your own :) or perhaps do the processing server-side, and load the results via AJAX?
I support Richard's answer here - but to add to it - one of the easiest ways of building a tokeniser (imho) is Antlr; and some maniac has built a Javascript target for it; thus allowing you to run and execute a grammar in the web browser (look under 'runtime libraries' section here)
I won't pretend that there's not a learning curve there though.
Take a look at regular expressions - you can define almost any parsing algorithm you want.
I am writing a simple parser for C. I was just running it with some other language files (for fun - to see the extent of C-likeness and laziness - don't wanna really write separate parsers for each language if I can avoid it).
However the parser seems to break down for JavaScript if the code being parsed contains regular expressions...
Case 1:
For example, while parsing the JavaScript code snippet,
var phone="(304)434-5454"
phone=phone.replace(/[\(\)-]/g, "")
//Returns "3044345454" (removes "(", ")", and "-")
The '(', '[' etc get matched as starters of new scopes, which may never be closed.
Case 2:
And, for the Perl code snippet,
# Replace backslashes with two forward slashes
# Any character can be used to delimit the regex
$FILE_PATH =~ s#\\#//#g;
The // gets matched as a comment...
How can I detect a regular expression within the content text of a "C-like" program-file?
It is impossible.
Take this, for example:
m =~ s/a/b/g;
Could be both C or perl.
One minute's thinking reveals, that the number of perl style regular expressions that are also sntyctically valid C expressions is infinite.
Another example:
m+foo *bar[index]+i
The best you can get is some extreme vague guesswork. The difficulty stems from the fact that a regular expression is a sequence of characters that can be virtually everything.
You better clean up your error handling. A parser should not "break down" if some parenthesis are missing or superfluous ones are seen.
Well, your token grammar has to take regex syntax into consideration. Classic parsers consist of two layers: something to tokenize the input, and then something to parse the grammar. The syntax of the language is generally expressed in terms of tokens, so the job of the tokenizer is to feed a stream of those to the parser. Generally the tokens them selves are regular expressions, or more properly a great big regex of things ORed together. At each character position on the input, one of the token regexes must match or else the character is invalid.
Now, there are other parsing techniques that sort-of squish together the tokenization with the parsing. ("PEG" parsers for example)
edit — another note: you can't parse languages like Javascript or Perl with just a regular expression.
I'm starting to write a code syntax highlighter in JavaScript, and I want to highlight text that is in quotes (both "s and 's) in a certain color. I need it be able to not be messed up by one of one type of quote being in the middle of a pair of the other quotes as well, but i'm really not sure where to even start. I'm not sure how I should go about finding the quotes and then finding the correct end quote.
Unless you're doing this for the challenge, have a look at Google Code Prettify.
For your problem, you could read up on parsing (and lexers) at Wikipedia. It's a huge topic and you'll find that you'll come upon bigger problems than parsing strings.
To start, you could use regular expressions (although they rarely have the accuracy of a true lexer.) A typical regular expression for matching a string is:
/"(?:[^"\\]+|\\.)*"/
And then the same for ' instead of ".
Otherwise, for a character-by-character parser, you would set some kind of state that you're in a string once you hit ", then when you hit " that is not preceded by an uneven amount of backslashes (an even amount of backslashes would escape eachother), you exit the string.
You can find quotes using regular expressions but if you're writing a syntax highlighter then the only reliable way is to step through the code, character by character, and decide what to do from there.
E.g. of a Regex
/("|')((?:\\\1|.)+?)\1/g
(matches "this" and 'this' and "thi\"s")
use stack.. if unmatched quote found push it.. if match found pop
I did it with a single regular expression in php using backwards references. JS does not support it and i think that's what you need if you really want to detect undefined backslashes.