How to match unicode by writing system (SCRIPT)?

How to match unicode by writing system (SCRIPT)? - javascript

Sorry for the bad English, using google translator
What regular expression (REGEX) will be able to correspond by script type? Example: \p{Script: Latin} | \p{Script: Zyyy} | \p{Script: Greek}
The main objective is to use the expression to check string in PHP and JAVASCRIPT. But it would be ideal to be functional in all languages.
And also required the statement to be able to find negative cases, example:
[^a] | [^\p{Script: Latin}]
On the website below, has the ranges list for each SCRIPT. But turn it into a single string has caused divergence in particular Zyyy script that matches everything, and should not.
http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt
Can someone please help me?

If you just need a single Unicode-aware regular expression, you may not want to pull in the entire XRegExp library + its Unicode plugin for just that. An alternative solution would be to use a build script that compiles the regular expression using Regenerate and the Unicode data packages.
Here’s what that would look like in Node.js:
var regenerate = require('regenerate');
// Latin script
var Latin = require('unicode-7.0.0/scripts/Latin/code-points');
// Greek script
var Greek = require('unicode-7.0.0/scripts/Greek/code-points');
var set = regenerate() // Start with an empty set.
.add(Latin) // Add Latin script code points.
.add(Greek) // Add Greek script code points.
// Print the result.
console.log(set.toString());
Run npm install regenerate unicode-7.0.0, and then run this script as follows:
node generate-regular-expression.js
It prints the following output:
[A-Za-z\xAA\xBA\xC0-\xD6\xD8-\xF6\xF8-\u02B8\u02E0-\u02E4\u0370-\u0373\u0375-\u0377\u037A-\u037D\u037F\u0384\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03E1\u03F0-\u03FF\u1D00-\u1D2A\u1D2C-\u1D77\u1D79-\u1DBF\u1E00-\u1F15\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FC4\u1FC6-\u1FD3\u1FD6-\u1FDB\u1FDD-\u1FEF\u1FF2-\u1FF4\u1FF6-\u1FFE\u2071\u207F\u2090-\u209C\u2126\u212A\u212B\u2132\u214E\u2160-\u2188\u2C60-\u2C7F\uA722-\uA787\uA78B-\uA78E\uA790-\uA7AD\uA7B0\uA7B1\uA7F7-\uA7FF\uAB30-\uAB5A\uAB5C-\uAB5F\uAB64\uAB65\uFB00-\uFB06\uFF21-\uFF3A\uFF41-\uFF5A]|\uD800[\uDD40-\uDD8C\uDDA0]|\uD834[\uDE00-\uDE45]
This can be used directly as part of a regular expression literal.
The main advantage of this approach is that you’ll never have to tweak the regular expression manually. Instead, you can just change the script that generates it by adding or removing some symbols, then running it again. The code of the script is much more readable and maintainable than any regular expression, IMHO. Also, the output is as compact as possible: rather than introducing an entire library as a run-time dependency, you just insert a single regular expression literal.

Related

Parsing Javascript code using Flex parser

My motive is to mangle variable and function names and also encrypt strings in a javascript file.
For this I only need to separate strings, comments, and variable/function names.
I've tried UglifyJs2 but I need more control on myself so I tried to write a lexer myself using Flex.
I'm able to take care of comments and quoted strings.
However I'm stuck in regular expression format for example /"/ -- a regular expression containing quotes causing correct parsing to fail.
Looks like to correctly identify a regular expression i'd need Bison parser using grammar rules otherwise comments, strings and regular expression get mixed up.
I don't want to get that far and use Bison.
One way is to move all regular expression code to another file in functions.
Is there any other alternative so that I can handle this in Flex itself?

If you can run JavaScript, you can use Esprima, a JavaScript parser coded in JavaScript. It can even run in your browser or any runtime like NodeJS.
It can output just tokens or abstract syntax trees. I believe that this should enough for you.

Highlighting HTML parts matching a perl regex - do it in server side perl or client side javascript?

A Perl CGI application is providing a search function. The application writes matching snippets to the HTML page. Now I would like to highlight the matches inside the snippets. I could use something like
s/($searchregex)/<span class="highlight">$1<\/span>/gi
to highlight the matches. This is working fine for text only cases, but breaks sometimes with snippets containing itself HTML tag, e.g. for links or images with references. In failing cases the above replacement is destroying the HTML links by inserting the span tag inside the href value.
At the moment I am seeing three possible solutions:
Write a regex that is not replacing matches inside of html tags, e.g. inside <>. I am not aware how to write a replacement regex for this case. Is there a perl regex to allow this replacement and how does it look like?
Write a regex that replaces all wrong replacements of the above replacement. This would fix the wrong span tags inside the href.
Use Javascript to highlight the matches inside the resulting DOM tree. Possible ways using jQuery are outlined in highlight html with matching text. Even normal Javascript may be enough JavaScript’s Regular Expression Flavor. There are special jQuery plugins for highlighting highlight regular expressions , too. I am new to Javascript so some more advise is appreciated, too.
What is the preferable solution? The best way would to it as 1. - but that seems not possible. So the remaining question is: Do the work in an ugly way on the server side or introduce Javascript to solve the problem in a cleaner way on the client side.

in perl with a lookahead after pattern
s/($searchregex)(?=[^>]*<)/<span class="highlight">$1<\/span>/gi
or shorter
s/$searchregex(?=[^>]*<)/<span class="highlight">$&<\/span>/gi
but maybe you will need to read the whole file in a string or change the input record separator ($/) to '<', because the regexp matches the pattern if it's followed by a sequence of any character except '>' and by '<' because will not match if ($/="\n" and there is a newline between pattern and next '<'.

You could use an HTML parser on the server side, which is the correct tool for the job you are doing.
Or you could do it with javascript as you say, which I prefer myself as it is more versatile, and could lead to more interactivity, although you would probably be facing a similar issue to what you are facing now (just that you have moved it to the client side).
It is actually a more complex question than it first appears. Without more information, it is impossible to try to come up with a better solution.
One good solution would be to traverse the DOM tree and match against each text node, but you have a problem then that you would not match text that spans several text nodes - for example "John the Con Johnson" would not match the search for "John the Con" as they would be in separate nodes. This might or might not be a problem for you, depending on your use case.

How to implement Lexical Analysis in Javascript

Hey folks, thanks for reading
I am currently attempting to do a Google-style calculator. You input a string, it determines if it can be calculated and returns the result.
I began slowly with the basics : + - / * and parenthesis handling.
I am willing to improve the calculator over time, and having learned a bit about lexical analysis a while ago, I built a list of tokens and associated regular expression patterns.
This kind of work is easily applicable with languages such as Lex and Yacc, except I am developping a Javascript-only application.
I tried to transcript the idea into Javascript but I can't figure out how to handle everything in a clean and beautiful way, especially nested parenthesis.
Analysis
Let's define what a calculator query is:
// NON TERMINAL EXPRESSIONS //
query -> statement
query -> ε // means end of query
statement -> statement operator statement
statement -> ( statement )
statement -> prefix statement
statement -> number
number -> integer
number -> float
// TERMINAL EXPRESSIONS //
operator -> [+*/%^-]
prefix -> -
integer -> [0-9]+
float -> [0-9]+[.,][0-9]+
Javascript
Lexical Analysis consists in verifying there is nothing that doesn't look like one of the terminal expressions : operator, prefixes, integer and float. Which can be shortened to one regular expression:
(I added spaces to make it more readable)
var calcPat =
/^ (\s*
( ([+/*%^-]) | ([0-9]+) | ([0-9]+[.,][0-9]+) | (\() | (\)) )
)+ \s* $/;
If this test passes, query is lexically correct and needs to be grammar-checked to determine if it can be calculated. This is the tricky part
I am not going to paste code because it is not clean nor easily understandable, but I am going to explain the process I followed and why I'm stuck:
I created a method called isStatement(string) that's supposed to call itself recursively. The main idea is to split the string into 'potential' statements and check if they really are statements and form one altogether.
Process is the following:
-If the first two tokens are a number followed by an operator:
-Then,
-- If the remaining is just one token and it is a number:
--- Then this is a statement.
--- Else, check if the remaining tokens form a statement (recursive call)
-Else, If the first token is a parenthesis
-Then, Find matching closing parenthesis and check if what's inside is a statement (recursion)
-- Also check if there is something after closing parenthesis and if it forms a statement when associated with the parenthesis structure.
What's the problem ?
My problem is that I cannot find matching parenthesis when there is nested structures. How can I do that ? Also, as you can see, this is not a particurlarly generic and clean grammar-checking algorithm. Do you have any idea to improve this pattern ?
Thank you so much for having taken the time to read everything.
Gael
(PS: As you probably noticed, I am not a native english speaker ! Sorry for mistakes and all !)

You've got the right idea about what lexical analysis is, but you seem to have gotten confused about the distinction between the token grammar and the language grammar. Those are two different things.
The token grammar is the set of patterns (usually regular expressions) that describe the tokens for the language to be parsed. The regular expressions are expressions over a character set.
The language grammar (or target grammar, I suppose) is the grammar for the language you want to parse. This grammar is expressed in terms of tokens.
You cannot write a regular expression to parse algebraic notation. You just can't. You can write a grammar for it, but it's not a regular grammar. What you want to do is recognize separate tokens, which in your case could be done with a regular expression somewhat like what you've got. The trick is that you're not really applying that expression to the overall sentence to be parsed. Instead, you want to match a token at the current point in the sentence.
Now, because you've got Javascript regular expressions to work with, you could come up with a regular expression designed to match a string of tokens. The trick with that will be coming up with a way to identify which token was matched out of the list of possibilities. The Javascript regex engine can give you back arrays of groups, so maybe you could build something on top of that.
edit — I'm trying to work out how you could put together a (somewhat) general-purpose tokenizer builder, starting from a list of separate regular expressions (one for each token). It's possibly not very complicated, and it'd be pretty fun to have around.

Javascript lexer / tokenizer (in Python?)

Does anyone know of a Javascript lexical analyzer or tokenizer (preferably in Python?)
Basically, given an arbitrary Javascript file, I want to grab the tokens.
e.g.
foo = 1
becomes something like:
variable name : "foo"
whitespace
operator : equals
whitespace
integer : 1

http://code.google.com/p/pynarcissus/ has one.
Also I made one but it doesn't support automatic semicolon insertion so it is pretty useless for javascript that you have no control over (as almost all real life javascript programs lack at least one semicolon) :) Here is mine:
http://bitbucket.org/santagada/jaspyon/src/tip/jaspyon/
the grammar is in jsgrammar.txt, it is parsed by the PyPy parsing lib (which you will have to download and extract from the pypy source) and it build a parse tree which I walk on astbuilder.py
But if you don't have licensing problems I would go with pynarcissus. heres a direct link to look at the code (ported from narcissus):
http://code.google.com/p/pynarcissus/source/browse/trunk/jsparser.py

Parse JavaScript to instrument code

I need to split a JavaScript file into single instructions. For example
a = 2;
foo()
function bar() {
b = 5;
print("spam");
}
has to be separated into three instructions. (assignment, function call and function definition).
Basically I need to instrument the code, injecting code between these instructions to perform checks. Splitting by ";" wouldn't obviously work because you can also end instructions with newlines and maybe I don't want to instrument code inside function and class definitions (I don't know yet). I took a course about grammars with flex/Bison but in this case the semantic action for this rule would be "print all the descendants in the parse tree and put my code at the end" which can't be done with basic Bison I think. How do I do this? I also need to split the code because I need to interface with Python with python-spidermonkey.
Or... is there a library out there already which saves me from reinventing the wheel? It doesn't have to be in Python.

Why not use a JavaScript parser? There are lots, including a Python API for ANTLR and a Python wrapper around SpiderMonkey.

JavaScript is tricky to parse; you need a full JavaScript parser.
The DMS Software Reengineering Toolkit can parse full JavaScript and build a corresponding AST.
AST operators can then be used to walk over the tree to "split it". Even easier, however, is to apply source-to-source transformations that look for one surface syntax (JavaScript) pattern, and replace it by another. You can use such transformations to insert the instrumentation into the code, rather than splitting the code to make holds in which to do the insertions. After the transformations are complete, DMS can regenerate valid JavaScript code (complete with the orignal comments if unaffected).

Why not use an existing JavaScript interpreter like Rhino (Java) or python-spidermonkey (not sure whether this one is still alive)? It will parse the JS and then you can examine the resulting parse tree. I'm not sure how easy it will be to recreate the original code but that mostly depends on how readable the instrumented code must be. If no one ever looks at it, just generate a really compact form.
pyjamas might also be of interest; this is a Python to JavaScript transpiler.
[EDIT] While this doesn't solve your problem at first glance, you might use it for a different approach: Instead of instrumenting JavaScript, write your code in Python instead (which can be easily instrumented; all the tools are already there) and then convert the result to JavaScript.
Lastly, if you want to solve your problem in Python but can't find a parser: Use a Java engine to add comments to the code which you can then search for in Python to instrument the code.

Why not try a javascript beautifier?
For example http://jsbeautifier.org/
Or see Command line JavaScript code beautifier that works on Windows and Linux

Forget my parser. https://bitbucket.org/mvantellingen/pyjsparser is great and complete parser. I've fixed a couple of it's bugs here: https://bitbucket.org/nullie/pyjsparser

We Keep Coding

JavaScript is the programming language of the Web.