JavaScript client-side search engine

JavaScript client-side search engine - javascript

I am developing a local web application with jQuery/JavaScript.
My goal is to create search engine for searching content from a JSON file. I already made it with regex, but it works slowly.
What is the best way? Is there a JavaScript search engine?

Try lunr.js which supports full-text search in JavaScript.

The term "search engine" normally means that a large set of data is indexed (a resource intensive task). Searching the data set after indexing is then quick. If the data set is very large, it is more likely that indexing and search will be performed on the server (and only the search results are then returned to the browser).
If you just need to search fields in a JSON file that is small or medium in size, then consider JavaScript "search algorithms" rather than search engines.

Fuse.js is a lightweight fuzzy-search, in JavaScript, with zero dependencies. It has more github stars than Lunr, and has ongoing (and recent) commits as of August 2022.

I think it doesn't exist, you have developpe it
put your application contents in String object and developpe search function with indexOf (for target string) & substring (to extract fragment)

Have a look at fullproof, it's a basic search engine for use int he browser http://reyesr.github.com/fullproof/
There may be others though.

I think you have two options. Lunr that is mentioned earlier and search-index.

Elasticlunr is a lightweight full-text search engine in Javascript for browser search and offline search. Elasticlunr.js is developed based on Lunr.js, but more flexible than lunr.js.
Elasticlunr.js provides Query-Time boosting and field search. Elasticlunr.js is a bit like Solr, but much smaller and not as bright, but also provide flexible configuration and query-time boosting.

Related

How to modify the contents of a word document by changing Office Open XML representation of the file?

I am building an add-in for Office Word 2016 using the Word JavaScript API. As it does not provide the level of control over the document that I require I am trying to accomplish this by directly changing the OOXML of the document. Since the user can have a document with any number of pages, I am not sure if this is the right way to do this. I want to know if there is any way to simplify this like extracting only parts of the document and inserting it back.

Great question, for starters I am curious to know the level of control you are expecting in the API, I wonder if you can share more details on potential gaps (thanks in advance!).
Now, to answer to your question: absolutely! we open the door via OOXML to interact with the document. This is a very powerful tool, albeit potentially complicated (but seems to be you are knowledgeable of WordML) and it can be slow, specially in platforms other than Win32 or Mac (Word Online XML injection is sloooooooow).
The key to accomplish what you need is that you get a range (i would need a criteria more detailed on the "extracting parts of the document" you mentioned, but at the end of the day its about getting a range). Once you have it, you can actually do a range.insertOoxml("your OOXML","replace") to replace that range with whatever OOXML you have.
You can get a range by many different ways in the API. For instance the search method returns a collection of ranges. all objects have a .getRange() method you can use for this.
The following example replaces the first word of the first paragraph in the document with a given OOXML.
Word.run(function (ctx) {
var myTempOOXML = "get some valid OOXML!"
ctx.document.body.paragraphs.getFirst().split([" "],false,false,false).getFirst().insertOoxml(myTempOOXML, "replace");
return ctx.sync();
}).catch(function (e) { app.showNotification(e.message)})
Hope this sets you up in the right direction.
Btw, here is a useful article about ooxml and word.js

Natural Language Processing Database Querying

I need to develop natural language querying tool for a structured database. I tried two approaches.
using Python nltk (Natural Language Toolkit for python) using
Javascript and JSON (for data source)
In the first case I did some NLP steps to format the natural query by doing removing stop words, stemming, finally mapping keywords using featured grammar mapping. This methodology works for simple scenarios.
Then I moved to second approach. Finding the data in JSON and getting corresponding column name and table name , then building a sql query. For this one, I also implemented removing stop words, stemming using javascript.
Both of these techniques have limitations.I want to implement semantic search approach.
Please can anyone suggest me better approach to do this..

Semantic parsing for NLIDB (natural language interface to data bases) is a very evolved domain with many techniques: rule based methods (involving grammars) or machine learning techniques. They cover a large range of query inputs, and offer much more results than pure NL processing or regex methods.
The technique I favor is based on Feature based context-free grammars FCFG. For starters, in the NTLK book available online, look for the string "sql0.fcfg". The code example shows how to map the NL phrase structure query "What cities are located in China" into an SQL query "SELECT City FROM city_table WHERE Country="china" via the feature "SEM" or semantics of the FCFG.
I recommend Covington's books
NLP for Prloog Programmers (1994)
Prolog Programming in Depth (1997)
They will help You go a long way. These PDF's are downloadable from his site.

As I commented, I think you should add some code, since not everyone has read the book.
Anyway my conclusion is that yes, as you said it has a lot of limitations and the only way to achieve more complex queries is to write very extensive and complete grammar productions, a pretty hard work.

Best way to scrape a set of pages with mixed content

I’m trying to show a list of lunch venues around the office with their today’s menus. But the problem is the websites that offer the lunch menus, don’t always offer the same kind of content.
For instance, some of the websites offer a nice JSON output. Look at this one, it offers the English/Finnish course names separately and everything I need is available. There are couple of others like this.
But others, don’t always have a nice output. Like this one. The content is laid out in plain HTML and English and Finnish food names are not exactly ordered. Also food properties like (L, VL, VS, G, etc) are just normal text like the food name.
What, in your opinion, is the best way to scrape all these available data in different formats and turn them into usable data? I tried to make a scraper with Node.js (& phantomjs, etc) but it only works with one website, and it’s not that accurate in case of the food names.
Thanks in advance.

You may use something like kimonolabs.com, they are much easier to use and they give you APIs to update your side.
Remember that they are best for tabular data contents.

There my be simple algorithmic solutions to the problem, If there is a list of all available food names this can be really helpful, you find the occurrence of a food name inside a document (for today).
If there is not any food list, You may use TF/IDF. TF/IDF allows to calculate the score of a word inside a document among the current document and also other documents. But this solution needs enough data to work.
I think the best solution is some thing like this:
Creating a list of all available websites that should be scrapped.
Writing driver classes for each website data.
Each driver has the duty of creating the general domain entity from its standard document.
If you can use PHP, Simple HTML Dom Parser along with Guzzle would be a great choice. These two will provide a jQuery like path finder and a nice wrapper arround HTTP.

You are touching really difficult problem. Unfortunately there are no easy solutions.
Actually there are two different parts to solve:
data scraping from different sources
data integration
Let's start with first problem - data scraping from different sources. In my projects I usually process data in several steps. I have dedicated scrapers for all specific sites I want, and process them in the following order:
fetch raw page (unstructured data)
extract data from page (unstructured data)
extract, convert and map data into page-specific model (fully structured data)
map data from fully structured model to common/normalized model
Steps 1-2 are scraping oriented and steps 3-4 are strictly data-extraction / data-integration oriented.
While you can easily implement steps 1-2 relatively easy using your own webscrapers or by utilizing existing web services - data integration is the most difficult part in your case. You will probably require some machine-learning techniques (shallow, domain specific Natural Language Processing) along with custom heuristics.
In case of such a messy input like this one I would process lines separately and use some dictionary to get rid Finnish/English words and analyse what has left. But in this case it will never be 100% accurate due to possibility of human-input errors.
I am also worried that you stack is not very well suited to do such tasks. For such processing I am utilizing Java/Groovy along with integration frameworks (Mule ESB / Spring Integration) in order to coordinate data processing.
In summary: it is really difficult and complex problem. I would rather assume less input data coverage than aiming to be 100% accurate (unless it is really worth it).

JSON diff of large JSON data, finding some JSON as a subset of another JSON

I have a problem I'd like to solve to not have to spend a lot of manual work to analyze as an alternative.
I have 2 JSON objects (returned from different web service API or HTTP responses). There is intersecting data between the 2 JSON objects, and they share similar JSON structure, but not identical. One JSON (the smaller one) is like a subset of the bigger JSON object.
I want to find all the interesecting data between the two objects. Actually, I'm more interested in the shared parameters/properties within the object, not really the actual values of the parameters/properties of each object. Because I want to eventually use data from one JSON output to construct the other JSON as input to an API call. Unfortunately, I don't have the documentation that defines the JSON for each API. :(
What makes this tougher is the JSON objects are huge. One spans a page if you print it out via Windows Notepad. The other spans 37 pages. The APIs return the JSON output compressed as a single line. Normal text compare doesn't do much, I'd have to reformat manually or w/ script to break up object w/ newlines, etc. for a text compare to work well. Tried with Beyond Compare tool.
I could do manual search/grep but that's a pain to cycle through all the parameters inside the smaller JSON. Could write code to do it but I'd also have to spend time to do that, and test if the code works also. Or maybe there's some ready made code already for that...
Or can look for JSON diff type tools. Searched for some. Came across these:
https://github.com/samsonjs/json-diff or https://tlrobinson.net/projects/javascript-fun/jsondiff
https://github.com/andreyvit/json-diff
both failed to do what I wanted. Presumably the JSON is either too complex or too large to process.
Any thoughts on best solution? Or might the best solution for now be manual analysis w/ grep for each parameter/property?
In terms of a code solution, any language will do. I just need a parser or diff tool that will do what I want.
Sorry, can't share the JSON data structure with you either, it may be considered confidential.

Beyond Compare works well, if you set up a JSON file format in it to use Python to pretty-print the JSON. Sample setup for Windows:
Install Python 2.7.
In Beyond Compare, go under Tools, under File Formats.
Click New. Choose Text Format. Enter "JSON" as a name.
Under the General tab:
Mask: *.json
Under the Conversion tab:
Conversion: External program (Unicode filenames)
Loading: c:\Python27\python.exe -m json.tool %s %t
Note, that second parameter in the command line must be %t, if you enter two %ss you will suffer data loss.
Click Save.

Jeremy Simmons has created a better File Format package Posted on forum: "JsonFileFormat.bcpkg" for BEYOND COMPARE that does not require python or so to be installed.
Just download the file and open it with BC and you are good to go. So, its much more simpler.
JSON File Format
I needed a file format for JSON files.
I wanted to pretty-print & sort my JSON to make comparison easy.
I have attached my bcpackage with my completed JSON File Format.
The formatting is done via jq - http://stedolan.github.io/jq/
Props to
Stephen Dolan for the utility https://github.com/stedolan.
I have sent a message to the folks at Scooter Software asking them to
include it in the page with additional formats.
If you're interested in seeing it on there, I'm sure a quick reply to
the thread with an up-vote would help them see the value posting it.
Attached Files Attached Files File Type: bcpkg JsonFileFormat.bcpkg
(449.8 KB, 58 views)

I have a small GPL project that would do the trick for simple JSON. I have not added support for nested entities as it is more of a simple ObjectDB solution and not actually JSON (Despite the fact it was clearly inspired by it.
Long and short the API is pretty simple. Make a new group, populate it, and then pull a subset via whatever logical parameters you need.
https://github.com/danielbchapman/groups
The API is used basically like ->
SubGroup items = group
.notEqual("field", "value")
.lessThan("field2", 50); //...etc...
There's actually support for basic unions and joins which would do pretty much what you want.
Long and short you probably want a Set as your data-type. Considering your comparisons are probably complex you need a more complex set of methods.
My only caution is that it is GPL. If your data is confidential, odds are you may not be interested in that license.

Are there any javascript frameworks for parsing/auto-completing a domain specific language?

I have a grammar for a domain specific language, and I need to create a javascript code editor for that language. Are there any tools that would allow me to generate
a) a javascript incremental parser
b) a javascript auto-complete / auto-suggest engine?
Thanks!

An Example of implementing content assist (auto-complete)
using Chevrotain Javascript Parsing DSL:
https://github.com/SAP/chevrotain/tree/master/examples/parser/content_assist
Chevrotain was designed specifically to build parsers used (as part of) language services tools in Editors/IDEs.
Some of the relevant features are:
Automatic Error Recovery / Fault tolerance because editors and IDEs need to be able to handle 'mostly valid' inputs.
Every Grammar rule may be used as the starting rule as an Editor/IDE may only want to implement incremental parsing for performance reasons.

You may want jison, a js parser generator. In terms of auto-complete / auto-suggest...most of the stuff out there I know if more based on word completion rather than code completion. But once you have a parser running I don't think that part is too difficult..

This is difficult. I'm doing the same sort of thing myself.
One approach is:
You need is a parser which will give you an array of the currently possible ASTs for the text up until the token before the current cursor position.
From there you can see the next token can be of a number of types (usually just one), and do the completion, based on the partial text.
If I ever get my incremental parser working, I'll send a link.
Good luck, and let me know if you find a package which does this.
Chris.

We Keep Coding

JavaScript is the programming language of the Web.