Registering a new stemmer function in lunr for greek words doesn't work as expected. here is my code on codepen. I am not receiving any errors, the function stemWord() works fine when used separately but it fails to stem the words in lunr.
below is a sample of the code:
function stemWord(w) {
// code that returns the stemmed word
};
// create the new function
greekStemmer = function (token) {
return stemWord(token);
};
// register it with lunr.Pipeline, this allows you to still serialise the index
lunr.Pipeline.registerFunction(greekStemmer, 'greekStemmer')
var index = lunr(function () {
this.field('title', {boost: 10})
this.field('body')
this.ref('id')
this.pipeline.remove(lunr.trimmer) // it doesn't work well with non-latin characters
this.pipeline.add(greekStemmer)
})
index.add({
id: 1,
title: 'ΚΑΠΟΙΟΣ',
body: 'Foo foo foo!'
})
index.add({
id: 2,
title: 'ΚΑΠΟΙΕΣ',
body: 'Bar bar bar!'
})
index.add({
id: 3,
title: 'ΤΙΠΟΤΑ',
body: 'Bar bar bar!'
})
In lunr a stemmer is implemented as a pipeline function. A pipeline function is executed against each word in a document when indexing the document, and each word in a search query when searching.
For a function to work in a pipeline it has to implement a very simple interface. It needs to accept a single string as input, and it must respond with a string as its output.
So a very simple (and useless) pipeline function would look like the following:
var simplePipelineFunction = function (word) {
return word
}
To actually make use of this pipeline function we need to do two things:
Register it as a pipeline function, this allows lunr to correctly serialise and deserialise your pipeline.
Add it to your indexes pipeline.
That would look something like this:
// registering our pipeline function with the name 'simplePipelineFunction'
lunr.Pipeline.registerFunction(simplePipelineFunction, 'simplePipelineFunction')
var idx = lunr(function () {
// adding the pipeline function to our indexes pipeline
// when defining the pipeline
this.pipeline.add(simplePipelineFunction)
})
Now, you can take the above, and swap out the implementation of our pipeline function. So, instead of just returning the word unchanged, it could use the greek stemmer you have found to stem the word, maybe like this:
var myGreekStemmer = function (word) {
// I don't know how to use the greek stemmer, but I think
// its safe to assume it won't be that different than this
return greekStem(word)
}
Adapting lunr to work with a language other than English requires more than just adding your stemmer though. The default language of lunr is English, and so, by default, it includes pipeline functions that are specialised for English. English and Greek are different enough that you will probably run into issues trying to index Greek words with the English defaults, so we need to do the following:
Replace the default stemmer with our language specific stemmer
Remove the default trimmer which doesn't play so nice with non-latin characters
Replace/remove the default stop word filter, its unlikely to be much use on a language other than English.
The trimmer and stop word filter are implemented as pipeline functions, so implementing language specific ones would be similar for the stemmer.
So, to set up lunr for Greek you would have this:
var idx = lunr(function () {
this.pipeline.after(lunr.stemmer, greekStemmer)
this.pipeline.remove(lunr.stemmer)
this.pipeline.after(lunr.trimmer, greekTrimmer)
this.pipeline.remove(lunr.trimmer)
this.pipeline.after(lunr.stopWordFilter, greekStopWordFilter)
this.pipeline.remove(lunr.stopWordFilter)
// define the index as normal
this.ref('id')
this.field('title')
this.field('body')
})
For some more inspiration you can take a look at the excellent lunr-languages project, it has many examples of creating language extensions for lunr. You could even submit one for Greek!
EDIT Looks like I don't know the lunr.Pipeline API as well as I thought, there is no replace function, instead we just insert the replacement after the function to remove, and then remove it.
EDIT Adding this to help others in the future... It turns out the problem was down to the casing of the tokens within lunr. lunr wants to treat all tokens as lowercase, this is done, without any configurability, in the tokenizer. For most language processing functions this is not a problem, indeed, most assume words are lower cased. In this case, the Greek stemmer only stems uppercase words due to the complexity of stemming in Greek (I'm not a Greek speaker so can't comment on how much more complex that stemming is). A solution is to convert to upper case before calling the Greek stemmer, then convert back to lowercase before passing the tokens on to the rest of the pipeline.
Related
I'm sort of building an AI for a Telegram Bot, and currently I'm trying to process the text and respond to the user almost like a human does.
For example;
"I want to register"
As a human we understand that the user wants to register.
So I'd process this text using javascript's indexOf to look for want and register
var user_text = message.text;
if (user_text.indexOf('want') >= 0) {
if (user_text.indexOf('register') >= 0) {
console.log('He wants to register?')
}
}
But what if the text contains not somewhere in the string? Of course I'd have like a zillion of conditions for a zillion of cases. It'd be tiring to write this kind of logic.
My question is — Is there any other elegant way to do this? I don't really know the keyword to Google this...
The concept you're looking for is natural language processing and is a very broad field. Full NLP is very intricate and complicated, with all kinds of issues.
I would suggest starting with a much simpler solution, by splitting your input into words. You can do that using the String.prototype.split method with some tweaks. Filter out tokens you don't care about and don't contribute to the command, like "the", "a", "an". Take the remaining tokens, look for negation ("not", "don't") and keywords. You may need to combine adjacent tokens, if you have some two-word commands.
That could look something like:
var user_text = message.text;
var tokens = user_text.split(' '); // split on spaces, very simple "word boundary"
tokens = tokens.map(function (token) {
return token.toLowerCase();
});
var remove = ['the', 'a', 'an'];
tokens = tokens.filter(function (token) {
return remove.indexOf(token) === -1; // if remove array does *not* contain token
});
if (tokens.indexOf('register') !== -1) {
// User wants to register
} else if (tokens.indexOf('enable') !== -1) {
if (tokens.indexOf('not') !== -1) {
// User does not want to enable
} else {
// User does want to enable
}
}
This is not a full solution: you will eventually want to run the string through a real tokenizer and potentially even a full parser, and may want to employ a rule engine to simplify the logic.
If you can restrict the inputs you need to understand (a limited number of sentence forms and nouns/verbs), you can probably just use a simple parser with a few rules to handle most commands. Enforcing a predictable sentence structure with articles removed will make your life much easier.
You could also take the example above and replace the filter with a whitelist (only include words that are known). That would leave you with a small set of known tokens, but introduces the potential to strip useful words and misinterpret the command, so you should confirm with the user before running anything.
If you really want to parse and understand sentences expressed in natural language, you should look into the topic of natural language processing. This is usually done with some kind of neural network trained to "understand" different variations of sentences (aka machine learning), because specifying all of different syntactic and semantic rules of the language appears to be an overwhelming task.
If however the amount of variations of these sentences is limited, then you could specify some rules in the form of commonly used word combinations, probably even regular expressions would do in the simplest case.
I'm trying to implement an asymmetrical search for a dictionary web app, so searching for ü, for example, will return only tokens that actually contain ü, but searching for u will return both u and ü. (This is so users who don't know how to type special characters can still search for them, but users who do know how to type them won't be inundated with the plain character forms unnecessarily.)
It has to all be client-side JavaScript without any external libraries.
I've managed to make the second search type work by running both the search term and the text I'm searching through the following function, effectively merging special characters with their plain counterparts:
function cleanUp(dirty) {
cleaned = dirty.replace(/[áàâãäāă]/ig,"a");
cleaned = cleaned.replace(/đ/ig,"d");
cleaned = cleaned.replace(/[éèêẽëēĕ]/ig,"e");
cleaned = cleaned.replace(/[íìîĩïīĭ]/ig,"i");
cleaned = cleaned.replace(/ñ/ig,"n");
cleaned = cleaned.replace(/[óòôõöōŏ]/ig,"o");
cleaned = cleaned.replace(/[úùûũüūŭ]/ig,"u");
return cleaned;
}
I then compare the strings to get my results with something like:
var search_term = cleanup(search_input.value);
var text_to_search = cleanup(main_text);
if (text_to_search.indexOf(search_term) > -1) ... //do something
It's not elegant, but it works. After cleaning up both strings the user can search for i.e. uber and get über even if they don't know how to type ü. But if they do know how, searching for über directly also returns things like uber, which is what I don't want.
I've already thought of things like checking for each special character separately for each search term or duplicating every dictionary entry that has a special character to produce a special-character and a plain-character version, but all of my ideas would seriously slow down the processing time for the search.
Any ideas are greatly appreciated.
The answer you posted sounds quite reasonable.
I would just like to suggest a cleaner way (pun intended) to code your cleanup() function and similar functions that do a series of string operations:
function cleanUp(dirty) {
return dirty
.replace(/[áàâãäāă]/ig,"a")
.replace(/đ/ig,"d")
.replace(/[éèêẽëēĕ]/ig,"e")
.replace(/[íìîĩïīĭ]/ig,"i")
.replace(/ñ/ig,"n")
.replace(/[óòôõöōŏ]/ig,"o")
.replace(/[úùûũüūŭ]/ig,"u");
}
I ended up checking to see if the search term contained any special characters, and if it did, I didn't run it through cleanup(), and compared it to the original dictionary entry instead of the cleaned one. Thanks for the comments everyone.
I'm the author of doctest, quick and dirty doctests for JavaScript and CoffeeScript. I'd like to make the library less dirty by using a JavaScript parser rather than regular expressions to locate comments.
I'd like to use Esprima or Acorn to do the following:
Create an AST
Walk the tree, and for each comment node:
Create an AST from the comment node's text
Replace the comment node in the main tree with this subtree
Input:
!function() {
// > toUsername("Jesper Nøhr")
// "jespernhr"
var toUsername = function(text) {
return ('' + text).replace(/\W/g, '').toLowerCase()
}
}()
Output:
!function() {
doctest.input(function() {
return toUsername("Jesper Nøhr")
});
doctest.output(4, function() {
return "jespernhr"
});
var toUsername = function(text) {
return ('' + text).replace(/\W/g, '').toLowerCase()
}
}()
I don't know how to do this. Acorn provides a walker which takes a node type and a function, and walks the tree invoking the function each time a node of the specified type is encountered. This seems promising, but doesn't apply to comments.
With Esprima I can use esprima.parse(input, {comment: true, loc: true}).comments to get the comments, but I'm not sure how to update the tree.
Most AST-producing parsers throw away comments. I don't know what Esprima or Acorn do, but that might be the issue.
.... in fact, Esprima lists comment capture as a current bug:
http://code.google.com/p/esprima/issues/detail?id=197
... Acorn's code is right there in GitHub. It appears to throw comments away, too.
So, looks like you get to fix either parser to capture the comments first, at which point your task should be straightforward, or, you're stuck.
Our DMS Software Reengineering Toolkit has JavaScript parsers that capture comments, in the tree. It also has language substring parsers, that could be used to parse the comment text into JavaScript ASTs of whatever type the comment represents (e.g, function declaration, expression, variable declaration, ...), and the support machinery to graft such new ASTs into the main tree. If you are going to manipulate ASTs, this substring capability is likely important: most parsers won't parse arbitrary language fragments, they are wired only to parse "whole programs". For DMS, there are no comment nodes to replace; there are comments associated with ASTs nodes, so the grafting process is a little trickier than just "replace comment nodes". Still pretty easy.
I'll observe that most parsers (including these) read the source and break it into tokens by using or applying the equivalent of a regular expressions. So, if you are already using these to locate comments (that means using them to locate *non*comments to throw away, as well, e.g., you need to recognize string literals that contain comment-like text and ignore them), you are doing as well as the parsers would do anyway in terms of finding the comments. And if all you want to do is to replace them exactly with their content, echoing the source stream with the comment prefix/suffix /* */ stripped will do apparantly exactly what you want, so all this parsing machinery seems like overkill.
You can already use Esprima to achieve what you want:
Parse the code, get the comments (as an array).
Iterate over the comments, see if each is what you are interested in.
If you need to transform the comment, note its range. Collect all transformations.
Apply the transformation back-to-first so that the ranges are not shifted.
The trick is here not change the AST. Simply apply the text change as if you are doing a typical search replace on the source string directly. Because the position of the replacement might shift, you need to collect everything and then do it from the last one. For an example on how to carry out such a transformation, take a look at my blog post "From double-quotes to single-quotes" (it deals with string quotes but the principle remains the same).
Last but not least, you might want to use a slightly higher-level utility such as Rocambole.
I have a bbcode -> html converter that responds to the change event in a textarea. Currently, this is done using a series of regular expressions, and there are a number of pathological cases. I've always wanted to sharpen the pencil on this grammar, but didn't want to get into yak shaving. But... recently I became aware of pegjs, which seems a pretty complete implementation of PEG parser generation. I have most of the grammar specified, but am now left wondering whether this is an appropriate use of a full-blown parser.
My specific questions are:
As my application relies on translating what I can to HTML and leaving the rest as raw text, does implementing bbcode using a parser that can fail on a syntax error make sense? For example: [url=/foo/bar]click me![/url] would certainly be expected to succeed once the closing bracket on the close tag is entered. But what would the user see in the meantime? With regex, I can just ignore non-matching stuff and treat it as normal text for preview purposes. With a formal grammar, I don't know whether this is possible because I am relying on creating the HTML from a parse tree and what fails a parse is ... what?
I am unclear where the transformations should be done. In a formal lex/yacc-based parser, I would have header files and symbols that denoted the node type. In pegjs, I get nested arrays with the node text. I can emit the translated code as an action of the pegjs generated parser, but it seems like a code smell to combine a parser and an emitter. However, if I call PEG.parse.parse(), I get back something like this:
[
[
"[",
"img",
"",
[
"/",
"f",
"o",
"o",
"/",
"b",
"a",
"r"
],
"",
"]"
],
[
"[/",
"img",
"]"
]
]
given a grammar like:
document
= (open_tag / close_tag / new_line / text)*
open_tag
= ("[" tag_name "="? tag_data? tag_attributes? "]")
close_tag
= ("[/" tag_name "]")
text
= non_tag+
non_tag
= [\n\[\]]
new_line
= ("\r\n" / "\n")
I'm abbreviating the grammar, of course, but you get the idea. So, if you notice, there is no contextual information in the array of arrays that tells me what kind of a node I have and I'm left to do the string comparisons again even thought the parser has already done this. I expect it's possible to define callbacks and use actions to run them during a parse, but there is scant information available on the Web about how one might do that.
Am I barking up the wrong tree? Should I fall back to regex scanning and forget about parsing?
Thanks
First question (grammar for incomplete texts):
You can add
incomplete_tag = ("[" tag_name "="? tag_data? tag_attributes?)
// the closing bracket is omitted ---^
after open_tag and change document to include an incomplete tag at the end. The trick is that you provide the parser with all needed productions to always parse, but the valid ones come first. You then can ignore incomplete_tag during the live preview.
Second question (how to include actions):
You write socalled actions after expressions. An action is Javascript code enclosed by braces and are allowed after a pegjs expression, i. e. also in the middle of a production!
In practice actions like { return result.join("") } are almost always necessary because pegjs splits into single characters. Also complicated nested arrays can be returned. Therefore I usually write helper functions in the pegjs initializer at the head of the grammar to keep actions small. If you choose the function names carefully the action is self-documenting.
For an examle see PEG for Python style indentation. Disclaimer: this is an answer of mine.
Regarding your first question I have tosay that a live preview is going to be difficult. The problems you pointed out regarding that the parser won't understand that the input is "work in progress" are correct. Peg.js tells you at which point the error is, so maybe you could take that info and go a few words back and parse again or if an end tag is missing try adding it at the end.
The second part of your question is easier but your grammar won't look so nice afterwards. Basically what you do is put callbacks on every rule, so for example
text
= text:non_tag+ {
// we captured the text in an array and can manipulate it now
return text.join("");
}
At the moment you have to write these callbacks inline in your grammar. I'm doing a lot of this stuff at work right now, so I might make a pullrequest to peg.js to fix that. But I'm not sure when I find the time to do this.
Try something like this replacement rule. You're on the right track; you just have to tell it to assemble the results.
text
= result:non_tag+ { return result.join(''); }
I'm doing a JavaScript plugin, launched at every page-load, that replaces every matching structure with a link... That link redirects to a web application/database. A resource for coders of the Mount&Blade game.
In theory is easy, but I've found an huge obstacle in my way to the success: Regular expressions.
Even helped by a program named QuickRegex I can't get the structure to match. Or if I don't do a proper conditioning it outputs wrong results. The matching structure is as follows:
(item_set_slot, "itm_heavy_crossbow", slot_item_multiplayer_item_class),
I want to pick item_set_slot and turn it into a link to http://mbcommands.ollclan.eu/#$1
This is the code I'm using, that works, more or less. ;)
/* Mount&Blade Command Database Linking by Swyter */
function swymbcommandshooker(){
/* Regular HTML Expressions */
document.getElementsByTagName("body")[0].innerHTML=document.getElementsByTagName("body")[0].innerHTML.replace(/[\(]([a-zA-Z_]+)[\,]/gi, "(<a href='http://mbcommands.ollclan.eu/#$1' title='[?] Take an look in the Command Database' target='_blank'>$1</a>,");
/* Python highlighter Support...*/
document.getElementsByTagName("body")[0].innerHTML=document.getElementsByTagName("body")[0].innerHTML.replace(/(</span>([_a-z]+)\,/gi, "(</span><a href='http://mbcommands.ollclan.eu/#$1' title='[?] Take an look in the Command Database' target='_blank'>$1</a>,");
}
addOnloadHook( swymbcommandshooker );
Thanks in advance.
Hm, I'm not sure if I have understand you correctly, but if you really just want the match "item_set_slot" in "(item_set_slot, "itm_heavy_crossbow", slot_item_multiplayer_item_class)," the following regex should do:
/^\(([a-z_]+),/i
The JavaScript to generate the URL could look like this:
var tuple = '(item_set_slot, "itm_heavy_crossbow", slot_item_multiplayer_item_class),';
var url = tuple.replace(/^\(([a-z_]+),.*/i, 'http://mbcommands.ollclan.eu/#$1');
Note the appended .* in the regex, which is needed to match the rest of the tuple.