I'm doing a JavaScript plugin, launched at every page-load, that replaces every matching structure with a link... That link redirects to a web application/database. A resource for coders of the Mount&Blade game.
In theory is easy, but I've found an huge obstacle in my way to the success: Regular expressions.
Even helped by a program named QuickRegex I can't get the structure to match. Or if I don't do a proper conditioning it outputs wrong results. The matching structure is as follows:
(item_set_slot, "itm_heavy_crossbow", slot_item_multiplayer_item_class),
I want to pick item_set_slot and turn it into a link to http://mbcommands.ollclan.eu/#$1
This is the code I'm using, that works, more or less. ;)
/* Mount&Blade Command Database Linking by Swyter */
function swymbcommandshooker(){
/* Regular HTML Expressions */
document.getElementsByTagName("body")[0].innerHTML=document.getElementsByTagName("body")[0].innerHTML.replace(/[\(]([a-zA-Z_]+)[\,]/gi, "(<a href='http://mbcommands.ollclan.eu/#$1' title='[?] Take an look in the Command Database' target='_blank'>$1</a>,");
/* Python highlighter Support...*/
document.getElementsByTagName("body")[0].innerHTML=document.getElementsByTagName("body")[0].innerHTML.replace(/(</span>([_a-z]+)\,/gi, "(</span><a href='http://mbcommands.ollclan.eu/#$1' title='[?] Take an look in the Command Database' target='_blank'>$1</a>,");
}
addOnloadHook( swymbcommandshooker );
Thanks in advance.
Hm, I'm not sure if I have understand you correctly, but if you really just want the match "item_set_slot" in "(item_set_slot, "itm_heavy_crossbow", slot_item_multiplayer_item_class)," the following regex should do:
/^\(([a-z_]+),/i
The JavaScript to generate the URL could look like this:
var tuple = '(item_set_slot, "itm_heavy_crossbow", slot_item_multiplayer_item_class),';
var url = tuple.replace(/^\(([a-z_]+),.*/i, 'http://mbcommands.ollclan.eu/#$1');
Note the appended .* in the regex, which is needed to match the rest of the tuple.
Related
I'm trying to implement an asymmetrical search for a dictionary web app, so searching for ü, for example, will return only tokens that actually contain ü, but searching for u will return both u and ü. (This is so users who don't know how to type special characters can still search for them, but users who do know how to type them won't be inundated with the plain character forms unnecessarily.)
It has to all be client-side JavaScript without any external libraries.
I've managed to make the second search type work by running both the search term and the text I'm searching through the following function, effectively merging special characters with their plain counterparts:
function cleanUp(dirty) {
cleaned = dirty.replace(/[áàâãäāă]/ig,"a");
cleaned = cleaned.replace(/đ/ig,"d");
cleaned = cleaned.replace(/[éèêẽëēĕ]/ig,"e");
cleaned = cleaned.replace(/[íìîĩïīĭ]/ig,"i");
cleaned = cleaned.replace(/ñ/ig,"n");
cleaned = cleaned.replace(/[óòôõöōŏ]/ig,"o");
cleaned = cleaned.replace(/[úùûũüūŭ]/ig,"u");
return cleaned;
}
I then compare the strings to get my results with something like:
var search_term = cleanup(search_input.value);
var text_to_search = cleanup(main_text);
if (text_to_search.indexOf(search_term) > -1) ... //do something
It's not elegant, but it works. After cleaning up both strings the user can search for i.e. uber and get über even if they don't know how to type ü. But if they do know how, searching for über directly also returns things like uber, which is what I don't want.
I've already thought of things like checking for each special character separately for each search term or duplicating every dictionary entry that has a special character to produce a special-character and a plain-character version, but all of my ideas would seriously slow down the processing time for the search.
Any ideas are greatly appreciated.
The answer you posted sounds quite reasonable.
I would just like to suggest a cleaner way (pun intended) to code your cleanup() function and similar functions that do a series of string operations:
function cleanUp(dirty) {
return dirty
.replace(/[áàâãäāă]/ig,"a")
.replace(/đ/ig,"d")
.replace(/[éèêẽëēĕ]/ig,"e")
.replace(/[íìîĩïīĭ]/ig,"i")
.replace(/ñ/ig,"n")
.replace(/[óòôõöōŏ]/ig,"o")
.replace(/[úùûũüūŭ]/ig,"u");
}
I ended up checking to see if the search term contained any special characters, and if it did, I didn't run it through cleanup(), and compared it to the original dictionary entry instead of the cleaned one. Thanks for the comments everyone.
We write js programs for clients which allow them to craft the display text. Here is what we did
We have a raw js file which replaced those strings with tokens, for example
month = [_MonthToken_];
name = '_NameToken_';
and have a xml file to allow user to specify the text like
<xml>
<token name="MonthToken">'Jan','Feb','March'</token>
<token name="NameToken">Alice</token>
</xml>
and have a generator to replace the token with the text and generate the final js file.
month = ['Jan','Feb','March'];
name = 'Alice';
However, I found there is a bug in this scenario. When somebody specifies the name to be "D'Angelo" (for example.) the js will run into a error because the name variable will become
name='D'Angelo'
We have thought of several ways to fix the problem but none of which are perfect.
We may ask our clients to escape the characters, may it seems not appropriate given that they may not know js and there are more cases to escape (", ), which could make them unhappy :|
We also think of changing the generator to escape ', but sometimes the text may be replacing an array, the single quote there should not be escaped. (there are other cases, we may detect it case by case, but it is tedious)
We may have done something wrong for the whole scenario/architecture. but we don't want to change that unless we have confirmed that it is definitely necessary.
So, is there any solution? I will look into every ideas. Thank you in advanced!
(I may also need a better title :P)
I think your xml schema is poor designed, and this is the root cause of your problems.
Basically, you are forcing the author of the xml to put Javascript code inside of the name="MonthToken" element, while you pretend that she can do this without Javascript syntax knowledgement. I guess that you are planning to use eval on the parsed element content to build month and name variables.
The problem you discovered it's not the only one: you also are subject to Javascript code injection: what if a user forge an element such as:
<token name="MonthToken">alert('put some evil instruction here')</token>
I would suggest to change the xml schema in this way:
<xml>
<token name="MonthToken">Jan</token>
<token name="MonthToken">Feb</token>
<token name="MonthToken">March</token>
<token name="NameToken">Alice</token>
</xml>
Then in your generator, you'll have to parse each MonthToken element content, and add it to the month array. Do the same for the name variable.
In this way:
You don't use eval, and so you have no possibility of code injection
Your user doesn't no more have to know how to quote month names
You automatically handle quotes or apostrophe in names, because you are not using them as js code.
If you want month variable to become a string when user enter just a month, then simply transform the variable: with something similar to this:
if (month.length == 1) {
month = month[0];
}
I'm the author of doctest, quick and dirty doctests for JavaScript and CoffeeScript. I'd like to make the library less dirty by using a JavaScript parser rather than regular expressions to locate comments.
I'd like to use Esprima or Acorn to do the following:
Create an AST
Walk the tree, and for each comment node:
Create an AST from the comment node's text
Replace the comment node in the main tree with this subtree
Input:
!function() {
// > toUsername("Jesper Nøhr")
// "jespernhr"
var toUsername = function(text) {
return ('' + text).replace(/\W/g, '').toLowerCase()
}
}()
Output:
!function() {
doctest.input(function() {
return toUsername("Jesper Nøhr")
});
doctest.output(4, function() {
return "jespernhr"
});
var toUsername = function(text) {
return ('' + text).replace(/\W/g, '').toLowerCase()
}
}()
I don't know how to do this. Acorn provides a walker which takes a node type and a function, and walks the tree invoking the function each time a node of the specified type is encountered. This seems promising, but doesn't apply to comments.
With Esprima I can use esprima.parse(input, {comment: true, loc: true}).comments to get the comments, but I'm not sure how to update the tree.
Most AST-producing parsers throw away comments. I don't know what Esprima or Acorn do, but that might be the issue.
.... in fact, Esprima lists comment capture as a current bug:
http://code.google.com/p/esprima/issues/detail?id=197
... Acorn's code is right there in GitHub. It appears to throw comments away, too.
So, looks like you get to fix either parser to capture the comments first, at which point your task should be straightforward, or, you're stuck.
Our DMS Software Reengineering Toolkit has JavaScript parsers that capture comments, in the tree. It also has language substring parsers, that could be used to parse the comment text into JavaScript ASTs of whatever type the comment represents (e.g, function declaration, expression, variable declaration, ...), and the support machinery to graft such new ASTs into the main tree. If you are going to manipulate ASTs, this substring capability is likely important: most parsers won't parse arbitrary language fragments, they are wired only to parse "whole programs". For DMS, there are no comment nodes to replace; there are comments associated with ASTs nodes, so the grafting process is a little trickier than just "replace comment nodes". Still pretty easy.
I'll observe that most parsers (including these) read the source and break it into tokens by using or applying the equivalent of a regular expressions. So, if you are already using these to locate comments (that means using them to locate *non*comments to throw away, as well, e.g., you need to recognize string literals that contain comment-like text and ignore them), you are doing as well as the parsers would do anyway in terms of finding the comments. And if all you want to do is to replace them exactly with their content, echoing the source stream with the comment prefix/suffix /* */ stripped will do apparantly exactly what you want, so all this parsing machinery seems like overkill.
You can already use Esprima to achieve what you want:
Parse the code, get the comments (as an array).
Iterate over the comments, see if each is what you are interested in.
If you need to transform the comment, note its range. Collect all transformations.
Apply the transformation back-to-first so that the ranges are not shifted.
The trick is here not change the AST. Simply apply the text change as if you are doing a typical search replace on the source string directly. Because the position of the replacement might shift, you need to collect everything and then do it from the last one. For an example on how to carry out such a transformation, take a look at my blog post "From double-quotes to single-quotes" (it deals with string quotes but the principle remains the same).
Last but not least, you might want to use a slightly higher-level utility such as Rocambole.
I have a bbcode -> html converter that responds to the change event in a textarea. Currently, this is done using a series of regular expressions, and there are a number of pathological cases. I've always wanted to sharpen the pencil on this grammar, but didn't want to get into yak shaving. But... recently I became aware of pegjs, which seems a pretty complete implementation of PEG parser generation. I have most of the grammar specified, but am now left wondering whether this is an appropriate use of a full-blown parser.
My specific questions are:
As my application relies on translating what I can to HTML and leaving the rest as raw text, does implementing bbcode using a parser that can fail on a syntax error make sense? For example: [url=/foo/bar]click me![/url] would certainly be expected to succeed once the closing bracket on the close tag is entered. But what would the user see in the meantime? With regex, I can just ignore non-matching stuff and treat it as normal text for preview purposes. With a formal grammar, I don't know whether this is possible because I am relying on creating the HTML from a parse tree and what fails a parse is ... what?
I am unclear where the transformations should be done. In a formal lex/yacc-based parser, I would have header files and symbols that denoted the node type. In pegjs, I get nested arrays with the node text. I can emit the translated code as an action of the pegjs generated parser, but it seems like a code smell to combine a parser and an emitter. However, if I call PEG.parse.parse(), I get back something like this:
[
[
"[",
"img",
"",
[
"/",
"f",
"o",
"o",
"/",
"b",
"a",
"r"
],
"",
"]"
],
[
"[/",
"img",
"]"
]
]
given a grammar like:
document
= (open_tag / close_tag / new_line / text)*
open_tag
= ("[" tag_name "="? tag_data? tag_attributes? "]")
close_tag
= ("[/" tag_name "]")
text
= non_tag+
non_tag
= [\n\[\]]
new_line
= ("\r\n" / "\n")
I'm abbreviating the grammar, of course, but you get the idea. So, if you notice, there is no contextual information in the array of arrays that tells me what kind of a node I have and I'm left to do the string comparisons again even thought the parser has already done this. I expect it's possible to define callbacks and use actions to run them during a parse, but there is scant information available on the Web about how one might do that.
Am I barking up the wrong tree? Should I fall back to regex scanning and forget about parsing?
Thanks
First question (grammar for incomplete texts):
You can add
incomplete_tag = ("[" tag_name "="? tag_data? tag_attributes?)
// the closing bracket is omitted ---^
after open_tag and change document to include an incomplete tag at the end. The trick is that you provide the parser with all needed productions to always parse, but the valid ones come first. You then can ignore incomplete_tag during the live preview.
Second question (how to include actions):
You write socalled actions after expressions. An action is Javascript code enclosed by braces and are allowed after a pegjs expression, i. e. also in the middle of a production!
In practice actions like { return result.join("") } are almost always necessary because pegjs splits into single characters. Also complicated nested arrays can be returned. Therefore I usually write helper functions in the pegjs initializer at the head of the grammar to keep actions small. If you choose the function names carefully the action is self-documenting.
For an examle see PEG for Python style indentation. Disclaimer: this is an answer of mine.
Regarding your first question I have tosay that a live preview is going to be difficult. The problems you pointed out regarding that the parser won't understand that the input is "work in progress" are correct. Peg.js tells you at which point the error is, so maybe you could take that info and go a few words back and parse again or if an end tag is missing try adding it at the end.
The second part of your question is easier but your grammar won't look so nice afterwards. Basically what you do is put callbacks on every rule, so for example
text
= text:non_tag+ {
// we captured the text in an array and can manipulate it now
return text.join("");
}
At the moment you have to write these callbacks inline in your grammar. I'm doing a lot of this stuff at work right now, so I might make a pullrequest to peg.js to fix that. But I'm not sure when I find the time to do this.
Try something like this replacement rule. You're on the right track; you just have to tell it to assemble the results.
text
= result:non_tag+ { return result.join(''); }
We have a javascript function we use to track page stats internally. However, the URLs it reports many times include the page numbers for search results pages which we would rather not be reported. The pages that are reports are of the form:
http://www.test.com/directory1/2
http://www.test.com/directory1/subdirectory1/15
http://www.test.com/directory3/1113
Instead we'd like the above reported as:
http://www.test.com/directory1
http://www.test.com/directory1/subdirectory1
http://www.test.com/directory3
Please note that the numbered 'directory' and 'subdirectory' names above are just for example purposes and that the actual subdirectory names are all different, don't necessarily include numbers at the end of the directory name, and can be many levels deep.
Currently our JavaScript function produces these URLs using the code:
var page = location.hostname+document.location.pathname;
I believe we need to use the JavaScript replace function in combination with some regex but I'm at a complete loss as to what that would look like. Any help would be much appreciated!
Thanks in advance!
I think you want this:
var page = location.href.substring(0,location.href.lastIndexOf("/"));
You can use a regex for this:
document.location.pathname.replace(/\/\d+$/, "");
Unlike substring and lastIndexOf solutions, this will strip off the end of the path if it consists of digits only.
What you can do is find the last index of "/" and then use the substring function.
Not sure you need a regex if you're just pulling off the last slash + content.
http://www.w3schools.com/jsref/jsref_lastIndexOf.asp
I'd probably use that to search for the last "/" character, then do a substring from the start of the string to that index.
How about this:
var page = location.split("/");
page.pop();
page = page.join("/");
I would think you need to use the .htaccess with rewrite rules to change the look of the url, however I am still looking to see if this is available to javascript. Will repost when I find out more
EDIT*
the lastIndexOf would only give you the position, therefor you would still need to replace. ex:
var temp = page.substring(page.lastIndexOf("/"),page.length-1);
page = page.replace(temp, "");
unfortunately I'm not that advanced in my coding so there is probably more efficient coding in the other answers. Sorry for any inconveniences with my initial answer.