I'm trying to parse apart the different filter strings for an sql server WHERE clause, and I thought that regular expressions would be the way to go. Fortunately, the filter will always be flat, there will be no "sub-filters". Each filter statement will always be surrounded in parenthesis. Here's an example of a string I'd like to parse:
([IsActive]=(1)) AND ([NoticeDate] >= GETDATE()) AND (NoticeDate <= DATEADD(day, 90, GETDATE()))
The result I'd like is an array with the following items:
[0] = ([IsActive]=(1)) AND
[1] = ([NoticeDate] >= GETDATE()) AND
[2] = (NoticeDate <= DATEADD(day, 90, GETDATE()))
The closest I've come is the following regex:
/\(.+?\) (and|or)/i
but this only returns
[0] = ([IsActive]=(1)) AND
[1] = AND
So basically I'd like to return anything surrounded in parenthesis, followed by a space, followed by the string "and" or "or". The last statement will not be followed by and/or, though I could concatenate a " and" string if that would make it easier. This is being done using classic asp JScript. I've pretty much exhausted my (weak) regular expression abilities, any help would be greatly appreciated. Thanks in advance
Try this regex:
(?i)\(.+?\)(?:\s+(?:and|or)|$)
Output:
If your WHERE clauses are free-form, /\(.+?\) (and|or)/i will spuriously match inside tokens like strings and comments.
Consider
SELECT resting_place FROM Nation
WHERE date_founded LIKE "Four score and seven years ago"
AND conceived IN (liberty)
ORDER BY the_people
The and in "Four score and seven years" is not a SQL keyword, but a single regular expression approach is not going to be able to distinguish keyword uses from non-keyword uses.
A more robust way would probably be to do a proper parser. This should be doable with a classic LL(1) parser. There are plenty of parser generators around. In fact, one can easily consider regular expressions to be a kind of parser generator, too.
Anyway, for this kind of tasks, a proper LL(1) parser is likely the better idea. It would also be able to support nested terms, if you want.
https://en.wikipedia.org/wiki/LL_parser
Related
For example, given the string "2009/11/12" I want to get the regex ("\d{2}/d{2}/d{4}"), so I'll be able to match "2001/01/02" too.
Is there something that does that? Something similar? Any idea' as to how to do it?
There is text2re, a free web-based "regex by example" generator.
I don't think this is available in source code, though. I dare to say there is no automatic regex generator that gets it right without user intervention, since this would require the machine knowing what you want.
Note that text2re uses a template-based, modularized and very generalized approach to regular expression generation. The expressions it generates work, but they are much more complex than the equivalent hand-crafted expression. It is not a good tool to learn regular expressions because it does a pretty lousy job at setting examples.
For instance, the string "2009/11/12" would be recognized as a yyyymmdd pattern, which is helpful. The tool transforms it into this 125 character monster:
((?:(?:[1]{1}\d{1}\d{1}\d{1})|(?:[2]{1}\d{3}))[-:\/.](?:[0]?[1-9]|[1][012])[-:\/.](?:(?:[0-2]?\d{1})|(?:[3][01]{1})))(?![\d])
The hand-made equivalent would take up merely two fifths of that (50 characters):
([12]\d{3})[-:/.](0?\d|1[0-2])[-:/.]([0-2]?\d|3[01])\b
It's not possible to write a general solution for your problem. The trouble is that any generator probably wouldn't know what you want to check for, e.g. should "2312/45/67" be allowed too? What about "2009.11.12"?
What you could do is write such a generator yourself that is suited for your exact problem, but a general solution won't be possible.
I've tried a very naive approach:
class RegexpGenerator {
public static Pattern generateRegexp(String prototype) {
return Pattern.compile(generateRegexpFrom(prototype));
}
private static String generateRegexpFrom(String prototype) {
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < prototype.length(); i++) {
char c = prototype.charAt(i);
if (Character.isDigit(c)) {
stringBuilder.append("\\d");
} else if (Character.isLetter(c)) {
stringBuilder.append("\\w");
} else { // falltrought: literal
stringBuilder.append(c);
}
}
return stringBuilder.toString();
}
private static void test(String prototype) {
Pattern pattern = generateRegexp(prototype);
System.out.println(String.format("%s -> %s", prototype, pattern));
if (!pattern.matcher(prototype).matches()) {
throw new AssertionError();
}
}
public static void main(String[] args) {
String[] prototypes = {
"2009/11/12",
"I'm a test",
"me too!!!",
"124.323.232.112",
"ISBN 332212"
};
for (String prototype : prototypes) {
test(prototype);
}
}
}
output:
2009/11/12 -> \d\d\d\d/\d\d/\d\d
I'm a test -> \w'\w \w \w\w\w\w
me too!!! -> \w\w \w\w\w!!!
124.323.232.112 -> \d\d\d.\d\d\d.\d\d\d.\d\d\d
ISBN 332212 -> \w\w\w\w \d\d\d\d\d\d
As already outlined by others a general solution to this problem is impossible. This class is applicable only in few contexts
Excuse me, but what you all call impossible is clearly an achievable task. It will not be able to give results for ALL examples, and maybe not the best results, but you can give it various hints, and it will make life easy. A few examples will follow.
Also a readable output translating the result would be very useful.
Something like:
"Search for: a word starting with a non-numeric letter and ending with the string: "ing".
or: Search for: text that has bbb in it, followed somewhere by zzz
or: *Search for: a pattern which looks so "aa/bbbb/cccc" where "/" is a separator, "aa" is two digits, "bbbb" is a word of any length and "cccc" are four digits between 1900 and 2020 *
Maybe we could make a "back translator" with an SQL type of language to create regex, instead of creating it in geekish.
Here's are a few examples that are doable:
class Hint:
Properties: HintType, HintString
enum HintType { Separator, ParamDescription, NumberOfParameters }
enum SampleType { FreeText, DateOrTime, Formatted, ... }
public string RegexBySamples( List<T> samples,
List<SampleType> sampleTypes,
List<Hint> hints,
out string GeneralRegExp, out string description,
out string generalDescription)...
regex = RegExpBySamples( {"11/November/1999", "2/January/2003"},
SampleType.DateOrTime,
new HintList( HintType.NumberOfParameters, 3 ));
regex = RegExpBySamples( "123-aaaaJ-1444",
SampleType.Format, HintType.Seperator, "-" );
A GUI where you mark sample text or enter it, adding to the regex would be possible too.
First you mark a date (the "sample"), and choose if this text is already formatted, or if you are building a format, also what the format type is: free text, formatted text, date, GUID or Choose... from existing formats (which you can store in library).
Lets design a spec for this, and make it open source... Anybody wants to join?
Loreto pretty much does this. It's an open source implementation using the common longest substring(s) to generate the regular expressions. Needs multiple examples of course, though.
No, you cannot get a regex that matches what you want reliably, since the regex would not contain semantic information about the input (i.e. it would need to know it's generating a regex for dates). If the issue is with dates only I would recommend trying multiple regular expressions and see if one of them matches all.
I'm not sure if this is possible, at least not without many sample strings and some learning algorithm.
There are many regex' that would match and it's not possible for a simple algorithm to pick the 'right' one. You'd need to give it some delimiters or other things to look for, so you might as well just write the regex yourself.
sounds like a machine learning problem. You'll have to have more than one example on hand (many more) and an indication of whether or not each example is considered a match or not.
I don't remember the name but if my theory of computation cells serve me right its impossible in theory :)
I haven't found anything that does it , but since the problem domain is relatively small (you'd be surprised how many people use the weirdest date formats) , I've able to write some kind of a "date regular expression generator".
Once I'm satisfied with the unit tests , I'll publish it - just in case someone will ever need something of the kind.
Thanks to everyone who answered (the guy with the (.*) excluded - jokes are great , but this one was sssssssssoooo lame :) )
In addition to feeding the learning algorithm examples of "good" input, you could feed it "bad" input so it would know what not to look for. No letters in a phone number, for example.
I have a large valid JavaScript file (utf-8), from which I need to extract all text strings automatically.
For simplicity, the file doesn't contain any comment blocks in it, only valid ES6 JavaScript code.
Once I find an occurrence of ' or " or `, I'm supposed to scan for the end of the text block, is where I got stuck, given all the possible variations, like "'", '"', "\'", '\"', '", `\``, etc.
Is there a known and/or reusable algorithm for detecting the end of a valid ES6 JavaScript text block?
UPDATE-1: My JavaScript file isn't just large, I also have to process it as a stream, in chunks, so Regex is absolutely not usable. I didn't want to complicate my question, mentioning joint chunks of code, I will figure that out myself, If I have an algorithm that can work for a single piece of code that's in memory.
UPDATE-2: I got this working initially, thanks to the many advises given here, but then I got stuck again, because of the Regular Expressions.
Examples of Regular Expressions that break any of the text detection techniques suggested so far:
/'/
/"/
/\`/
Having studied the matter closer, by reading this: How does JavaScript detect regular expressions?, I'm afraid that detecting regular expressions in JavaScript is a whole new ball game, worth a separate question, or else it gets too complicated. But I appreciate very much if somebody can point me in the right direction with this issue...
UPDATE-3: After much research I found with regret that I cannot come up with an algorithm that would work in my case, because presence of Regular Expressions makes the task incredibly more complicated than was initially thought. According to the following: When parsing Javascript, what determines the meaning of a slash?, determining the beginning and end of regular expressions in JavaScript is one of the most complex and convoluted tasks. And without it we cannot figure out when symbols ', '"' and ` are opening a text block or whether they are inside a regular expression.
The only way to parse JavaScript is with a JavaScript parser. Even if you were able to use regular expressions, at the end of the day they are not powerful enough to do what you are trying to do here.
You could either use one of several existing parsers, that are very easy to use, or you could write your own, simplified to focus on the string extraction problem. I hardly imagine you want to write your own parser, even a simplified one. You will spend much more time writing it and maintaining it than you might think.
For instance, an existing parser will handle something like the following without breaking a sweat.
`foo${"bar"+`baz`}`
The obvious candidates for parsers to use are esprima and babel.
By the way, what are you planning to do with these strings once you extract them?
If you only need an approximate answer, or if you want to get the string literals exactly as they appear in the source code, then a regular expression can do the job.
Given the string literal "\n", do you expect a single-character string containing a newline or the two characters backslash and n?
In the former case you need to interpret escape sequences exactly like a JavaScript interpreter does. What you need is a lexer for JavaScript, and many people have already programmed this piece of code.
In the latter case the regular expression has to recognize escape sequences like \x40 and \u2026, so even in that case you should copy the code from an existing JavaScript lexer.
See https://github.com/douglascrockford/JSLint/blob/master/jslint.js, function tokenize.
Try code below:
txt = "var z,b \n;z=10;\n b='321`1123`321321';\n c='321`321`312`3123`';"
function fetchStrings(txt, breaker){
var result = [];
for (var i=0; i < txt.length; i++){
// Define possible string starts characters
if ((txt[i] == "'")||(txt[i] == "`")){
// Get our text string;
textString = txt.slice(i+1, i + 1 + txt.slice(i+1).indexOf(txt[i]));
result.push(textString)
// Jump to end of fetched string;
i = i + textString.length + 1;
}
}
return result;
};
console.log(fetchStrings(txt));
I'm a novice programmer making a simple calculator in JavaScript for a school project, and instead of using eval() to evaluate a string, I made my own function calculate(exp).
Essentially, my program uses order of operations (PEMDAS, or Parenthesis, Exponents, Multiplication/Division, Addition/Subtraction) to evaluate a string expression. One of my regex patterns is like so ("mdi" for multiplication/division):
mdi = /(-?\d+(\.\d+)?)([\*\/])(-?\d+(\.\d+)?)/g; // line 36 on JSFiddle
What this does is:
-?\d+ finds an integer number
(\.\d+)? matches the decimal if there is one
[\*\/] matches the operator used (* or / for multiplication or division)
/g matches every occurence in the string expression.
I loop through this regular expression's matches with the following code:
while((res = mdi.exec(exp)) !== null) { // line 69 on JSFiddle
exp = exp.replace(mdi,
function(match,$1,$3,$4,$5) {
if($4 == "*")
return parseFloat($1) * parseFloat($5);
else
return parseFloat($1) / parseFloat($5);
});
exp = exp.replace(doN,""); // this gets rid of double negatives
}
However, this does not work all the time. It only works with numbers with an absolute value less than 10. I cannot do any operations on numbers like 24 and -5232000321, even though the regex should match it with the + quantifier. It works with small numbers, but crashes and uses up most of my CPU when the numbers are larger than 10.
For example, when the expression 5*.5 is inputted, 2.5 is outputted, but when you input 75*.5 and press enter, the program stops.
I'm not really sure what's happening here, because I can't locate the source of the error for some reason - nothing is showing up even though I have console.log() all over my code for debugging, but I think it is something wrong with this regex. What is happening?
The full code (so far) is here at JSFiddle.net, but please be aware that it may crash. If you have any other suggestions, please tell me as well.
Thanks for any help.
The problem is
bzp = /^.\d/;
while((res = bzp.exec(result)) !== null) {
result = result.replace(bzp,
function($match) {
console.log($match + " -> 0 + " + $match);
return "0" + $match;
});
}
It keeps prepending zeros with no limit.
Removing that code it works well.
I have also cleaned your code, declared variables, and made it more maintainable: Demo
If you have any other suggestions, please tell me as well.
As pointed out in the comments, parsing your input by iteratively applying regular expressions is very ad-hoc. A better approach would be to actually construct a grammar for your input language and parse based on that. Here's an example grammar that basically matches your input language:
expr ::= term ( additiveOperator term )*
term ::= factor ( multiplicativeOperator factor )*
expr ::= number | '(' expr ')'
additiveOperator ::= '+' | '-'
multiplicativeOperator ::= '*' | '/'
The syntax here is pretty similar to regular expressions, where parenthesese denote groups, * denotes zero-or-more repetitions, and | denotes alternatives. The symbols enclosed in single quotes are literals, whereas everything else is symbolic. Note that this grammar doesn't handle unary operators (based on your post it sounds like you assume a single negative sign for negative numbers, which can be parsed by the number parser).
There are several parser-generator libraries for JavaScript, but I prefer combinator-style parsers where the parser is built functionally at runtime rather than having to run a separate tool to generate the code for your parer. Parsimmon is a nice combinator parser for JavaScript, and the API is pretty easy to wrap your head around.
A parser usually returns some sort of a tree data structure corresponding to the parsed syntax (i.e. an abstract syntax tree). You then traverse this data structure in order to calculate the value of the arithmetic expression.
I created a fiddle demonstrating parsing and evaluating of arithmetic expressions. I didn't integrate any of this into your existing calculator interface, but if you can understand how to use the parser
Mathematical expression are not parsed and calculated with regular expressions because of the number of permutations and combinations available. The faster way so far, is POST FIX notation because other notations are not as fast as this one. As they mention on Wikipedia:
In comparison testing of reverse Polish notation with algebraic
notation, reverse Polish has been found to lead to faster
calculations, for two reasons. Because reverse Polish calculators do
not need expressions to be parenthesized, fewer operations need to be
entered to perform typical calculations. Additionally, users of
reverse Polish calculators made fewer mistakes than for other types of
calculator. Later research clarified that the increased speed
from reverse Polish notation may be attributed to the smaller number
of keystrokes needed to enter this notation, rather than to a smaller
cognitive load on its users. However, anecdotal evidence suggests
that reverse Polish notation is more difficult for users to learn than
algebraic notation.
Full article: Reverse Polish Notation
And also here you can see other notations that are still far more better than regex.
Calculator Input Methods
I would therefore suggest you change your algorithm to a more efficient one, personally I would prefer POST FIX.
I use match to split a mathematics expression into separated strings and save them in an array.
var STRING = ST.match(/\d*\.\d+|\d+|[()/*+-]/g);
but this method separate everything including negative numbers which are inside parentheses.
For example (-2+4) does not give me -2, instead it saves - in one index of STRING array and 2 in the next index.
Is there anyway use match and save negative numbers which are in the parentheses?
This is what I want:
(-2+4):
STRING[0] give me (
STRING[1] give me -2
STRING[2] give me +
STRING[3] give me 4
STRING[4] give me )
and if there is no negative number work as normal:
(2+4):
STRING[0] give me (
STRING[1] give me 2
STRING[2] give me +
STRING[3] give me 4
STRING[4] give me )
I don't think it's possible to parse complex cases like "(-2+4*-(3.5--8))" with just a regex especially given we don't have negative look behind in javascript.
A solution would be to postprocess your match array by merging signs when they're between a separator and an unsigned expression.
In my opinion a regex is useful here, but only for the primary tokenization. Most of the work will be ahead of you as you'll build the binary expression tree (or any other formal representation you choose).
Unfortunately, if what you're trying to do is parsing a mathematical expression, regexps can not be used.
RegExps can be used in languages that are describable by Regular Grammars and arithmetical expressions can not, they are described by a Context Free Grammar (CFG). If you want to parse, and perhaps interpret the result, you'll certainly need some stacked state machine.
You can look at something like this well known algorithm.
Hope this helps.
You can add an optional sign to the numbers, that would work with your example:
var STRING = ST.match(/-?\d*\.\d+|-?\d+|[()/*+-]/g);
However, that will also turn a minus operator into a sign. The expression (4-2) would give you { "(", "4", "-2", ")" }.
Also, it will happily "parse" an expression like +---((((*** without complaining. If you want a result that makes sense, you should parse it for real, not just split it with a regular expression.
I think you have some mistake in your RegExp try this, it works for me:
var STRING = ST.match(/(\d*)(\.)(\d+)|(\d+)|[()\/*+-]/g);
I am trying to write some JavaScript RegEx to replace user inputed tags with real html tags, so [b] will become <b> and so forth. the RegEx I am using looks like so
var exptags = /\[(b|u|i|s|center|code){1}]((.){1,}?)\[\/(\1){1}]/ig;
with the following JavaScript
s.replace(exptags,"<$1>$2</$1>");
this works fine for single nested tags, for example:
[b]hello[/b] [u]world[/u]
but if the tags are nested inside each other it will only match the outer tags, for example
[b]foo [u]to the[/u] bar[/b]
this will only match the b tags. how can I fix this? should i just loop until the starting string is the same as the outcome? I have a feeling that the ((.){1,}?) patten is wrong also?
Thanks
The easiest solution would be to to replace all the tags, whether they are closed or not and let .innerHTML work out if they are matched or not it will much more resilient that way..
var tagreg = /\[(\/?)(b|u|i|s|center|code)]/ig
div.innerHTML="[b][i]helloworld[/b]".replace(tagreg, "<$1$2>") //no closing i
//div.inerHTML=="<b><i>helloworld</i></b>"
AFAIK you can't express recursion with regular expressions.
You can however do that with .NET's System.Text.RegularExpressions using balanced matching. See more here: http://blogs.msdn.com/bclteam/archive/2005/03/15/396452.aspx
If you're using .NET you can probably implement what you need with a callback.
If not, you may have to roll your own little javascript parser.
Then again, if you can afford to hit the server you can use the full parser. :)
What do you need this for, anyway? If it is for anything other than a preview I highly recommend doing the processing server-side.
You could just repeatedly apply the regexp until it no longer matches. That would do odd things like "[b][b]foo[/b][/b]" => "<b>[b]foo</b>[/b]" => "<b><b>foo</b></b>", but as far as I can see the end result will still be a sensible string with matching (though not necessarily properly nested) tags.
Or if you want to do it 'right', just write a simple recursive descent parser. Though people might expect "[b]foo[u]bar[/b]baz[/u]" to work, which is tricky to recognise with a parser.
The reason the nested block doesn't get replaced is because the match, for [b], places the position after [/b]. Thus, everything that ((.){1,}?) matches is then ignored.
It is possible to write a recursive parser in server-side -- Perl uses qr// and Ruby probably has something similar.
Though, you don't necessarily need true recursive. You can use a relatively simple loop to handle the string equivalently:
var s = '[b]hello[/b] [u]world[/u] [b]foo [u]to the[/u] bar[/b]';
var exptags = /\[(b|u|i|s|center|code){1}]((.){1,}?)\[\/(\1){1}]/ig;
while (s.match(exptags)) {
s = s.replace(exptags, "<$1>$2</$1>");
}
document.writeln('<div>' + s + '</div>'); // after
In this case, it'll make 2 passes:
0: [b]hello[/b] [u]world[/u] [b]foo [u]to the[/u] bar[/b]
1: <b>hello</b> <u>world</u> <b>foo [u]to the[/u] bar</b>
2: <b>hello</b> <u>world</u> <b>foo <u>to the</u> bar</b>
Also, a few suggestions for cleaning up the RegEx:
var exptags = /\[(b|u|i|s|center|code)\](.+?)\[\/(\1)\]/ig;
{1} is assumed when no other count specifiers exist
{1,} can be shortened to +
Agree with Richard Szalay, but his regex didn't get quoted right:
var exptags = /\[(b|u|i|s|center|code)](.*)\[\/\1]/ig;
is cleaner. Note that I also change .+? to .*. There are two problems with .+?:
you won't match [u][/u], since there isn't at least one character between them (+)
a non-greedy match won't deal as nicely with the same tag nested inside itself (?)
Yes, you will have to loop. Alternatively since your tags looks so much like HTML ones you could replace [b] for <b> and [/b] for </b> separately. (.){1,}? is the same as (.*?) - that is, any symbols, least possible sequence length.
Updated: Thanks to MrP, (.){1,}? is (.)+?, my bad.
How about:
tagreg=/\[(.?)?(b|u|i|s|center|code)\]/gi;
"[b][i]helloworld[/i][/b]".replace(tagreg, "<$1$2>");
"[b]helloworld[/b]".replace(tagreg, "<$1$2>");
For me the above produces:
<b><i>helloworld</i></b>
<b>helloworld</b>
This appears to do what you want, and has the advantage of needing only a single pass.
Disclaimer: I don't code often in JS, so if I made any mistakes please feel free to point them out :-)
You are right about the inner pattern being troublesome.
((.){1,}?)
That is doing a captured match at least once and then the whole thing is captured. Every character inside your tag will be captured as a group.
You are also capturing your closing element name when you don't need it and are using {1} when that is implied. Below is a cleanup up version:
/\[(b|u|i|s|center|code)](.+?)\[\/\1]/ig
Not sure about the other problem.