Library for generating natural language (verbs) in javascript? - javascript

Is there a good js library or function for creating natural phrases/sentences?
specifically, I have random things like "bird", "pencil" "football player" etc
and I want to be able to construct a sentence that would fit with the noun ... "johnny has 7 pencils"
"johnny has seen 3 birds"
"suzy knows 10 more football players"
the goal is to randomly generate sentences that can be modeled with algebraic expressions. It's easy enough except for generating a natural verb and getting the correct tense.
I looked briefly into natural language processing, but (at least on the surface) it mostly it looks like it goes the other way.
can you suggest a library, or if not, perhaps suggest an outline for an algorithm I could create?
Thank you!

It is very complicated, some even say impossible and I think that is the reason your question got down-voted. But your specific problem has-to some extend-a solution if you are OK with the restriction to not-overly-complicated sentences.
The basic English grammar is quite simple: subject, predicate, and object. So with a list of nouns and verbs you are already able to build grammatically correct sentences. The most difficult thing is to get the lists or build them yourself (e.g.: from some dictionary like the public domain 1913 version of Webster's) but the Internet offers several such lists.
If you have more lists (adjectives, adverbs, etc.) you can build more complex sentences. There are also lists with irregular verbs, uncommon plurals or nouns etc.
To makes things simpler I would not look for arbitrary sentences but build a bunch of them myself and allow some random generator to fill in the words from the correct list. With the simplest form SPO:
simpelSentence = randNoun() + randVerb() + randObject();
To some more complex form:
notSoSimpleSentence = randNoun() + randVerb() + randAdjective() + randObject();
Build more templates that way and if you have enough: start filling them, check the output and be disappointed. It does not work very good, you need to implement more rules like e.g.: he has but they have etc. and rest assured it needs many such rules even for the simplest sentences.
There are several scripts out there that can "write" scientific papers for you. The first hit from a Google search is SCIgen although it is written in Perl. These programs are known as "paper generators" and Lo! and Behold-they have a wikipage. If you follow that page one step higher you'll find the Category Natural Language Generation with some more information. This paragraph has some sentences…my…really hard to construct!
If you still want to do it: make lists with n-grams. Or use Google's n-gram lists (huge page with a lot of links to n-gram lists auto-generated by Google although of high quality) but be careful, these lists are huge. No, really, they are huge! Which means that you cannot just wrap them in an Array and use them directly. A megabyte or two are probably acceptable today (text-files compress well) but more than 100 gigabytes? So you have to dust off your word-washing pan and get the needed nuggets.
And after all of that hassle: how to you teach these sentences to make sense? How to avoid to put the bald man's nicely combed maroon hair up in rollers?
Nope, this problem had an overdose of a phosphodiesterase type 5 inhibitor (a non-psychoactive piperazine), sorry. An injection of methylene blue directly into the problem had been considered but assumed to only leave a monochromatic mess.
But serious: anything above some very simple sentences filled from some short lists with a handful of rules is out of reach for a little ECMA-script script written over a weekend or two.

Related

How to find all unique words in a 10 GB file or more & enable search, using JavaScript?

The question is to implement a web service that can read a 10GB file and store all distinct words & their occurrences. The requirements needs to be solved in O(n) or better complexity. The next part of the question is to write all client side code to allow search based on keypress.
How do I approach this problem? What would you suggest, are the main sub-headings?Do we need to use some sort of in-memory caching? Can 1 computer handle searching 10GB of data? Is there an approximation I should consider for distinct words based on Language (For example, in Cracking the coding interview I read there are about 600,000 distinct words in a language). How do I handle scalability of a system built this way? I really need help structuring my thoughts! Thanks in advance!
You shouldn't be using JavaScript for this. Pretty much any language will have better performance.
But, setting that aside, let's answer the question. What you'll want to do is create a Set and iterate through all words. Given the size of the data, you'll probably want to split it into chunks beforehand or at read time.
Just adding the key to the Set every time will suffice, as set only contains unique elements.
Alternatively, if you have 10+GB of RAM, just put the whole thing into an array and cast it to a set. Then you'll be able to read the unique values. It'll take quite a while, though.

How to make a mulitlingual website?

I am an amateur programmer and I want to make a mulitlingual website. My questions are: (For my purposes let the English website be website nr 1 and the Polish nr 2)
Should it be en.example.com and pl.example.com or maybe example.com/en and example.com/pl?
Should I make the full website in one language and then translate it?
How to translate it? Using XML or what? Should website 1 and website 2 be different html files or is there a way to translate a html file and then show the translation using XML or something?
If You need any code or something tell me. Thank You in advance :)
1) I don't think it makes much difference. The most important thing is to ensure that Google can crawl both languages, so don't rely on JavaScript to switch between languages, have everything done server side so both languages can be crawled and ranked in Google.
2) You can do one translation then the other, you just have to ensure that the layout expands to accommodate more/less text without breaking it. Maybe use lorem ipsum whilst designing.
3) I would put the translations into a database and then call that particular translation depending on whether it is EN or PL in the domain name. Ensure that the webpage and database are UTF-8 encoding otherwise you will find that you get 'funny' characters being displayed.
My Advice is that you start to use any Framework.
For instance if you use CakePHP then you have to write
__('My name is')
and in translate file
msgid "My name is"
msgstr "Nazywam się"
Then you can easy translate to any other language and its pretty easy to implement.
Also if you do not want to use Framework you can check this link to see example how it works:
http://tympanus.net/codrops/2009/12/30/easy-php-site-translation/
While this question probably is not a good SO question due to its broad nature. It might be relevant to many users.
My approach would be templates.
Your suggestion of having two html files is bad for the obvious reason of duplication- say you need to change something in your site. You would always need to change two html files- bad.
Having one html file and then parsing it and translating it sounds like a massive headache.
Some templating framework could help you massively. I have been using Smarty, but that's a personal choice and there are many options here.
The idea is you make a template file for your html and instead of actual content you use labels. Then in your php code you include the correct language depending on cookies, user settings or session data.
Storing labels is another issue here. Storing them in a database is a good option, however, remember you do not wish to make 100's of queries against a database for fetching each label. What you can do is store them in a database and then have it generate a language file- an array of labels->translations for faster access and regenerate these files whenever you add/update labels.
Or you can skip the database altogether and just store them in files, however, as these grow they might not be as easy to maintain.
I think the easiest mistake for an "amateur programmer" to make in this area is to allow the two (or more) language versions of the site to diverge. (I've seen so-called professionals make that mistake too...) You need to design it so everything that's common between the two versions is shared, so that when the time comes to make changes, you only need to make the changes in one place. The way you do this will depend on your choice of tools, and I'm not going to start advising on that, because it depends on many factors, for example the extent to which your content is database-oriented.
Closely related to this are questions of who is doing the translation, how the technical developers and the translators work together, and how to keep track of which content needs to be re-translated when it changes. These are essentially process questions rather than coding questions, so not a good fit for SO.
Don't expect that a site for one language can be translated without any technical impact; you will find you have made all sorts of assumptions about the length of strings, the order of fields, about character coding and fonts, and about cultural things like postcodes, that turn out to be invalid when you try to convert the site to a different language.
You could make 2 different language files and use php constants to define the text you want to translate for example:
lang_pl.php:
define("_TEST", "polish translation");
lang_en.php:
define("_TEST", "English translation");
now you could make a choice for the user between polish or english translation and based on that you can include the language file.
So if you would have a text field you put its value to _TEST (inbetween php brackets).
and it would show the translation of the chosen option.
The place i worked was doing it like this: they didn't have to much writing on their site, so they were keeping it in a database. As your tags have a php , I assume you know how to use databases. They had a mysql table called languages with language id(in your case 1 for en and 2 for pl) and texts in columns. So the column names were main heading, intro_text, about_us... when the user comes it selects the language and a php request get the page with that language. This is easy because your content is static and can be translated before site gets online, if your content is dynamic(users add content) you may need to use a translation machine because you cannot expect your users to write their entry in all languages.

Natural Language Processing Database Querying

I need to develop natural language querying tool for a structured database. I tried two approaches.
using Python nltk (Natural Language Toolkit for python) using
Javascript and JSON (for data source)
In the first case I did some NLP steps to format the natural query by doing removing stop words, stemming, finally mapping keywords using featured grammar mapping. This methodology works for simple scenarios.
Then I moved to second approach. Finding the data in JSON and getting corresponding column name and table name , then building a sql query. For this one, I also implemented removing stop words, stemming using javascript.
Both of these techniques have limitations.I want to implement semantic search approach.
Please can anyone suggest me better approach to do this..
Semantic parsing for NLIDB (natural language interface to data bases) is a very evolved domain with many techniques: rule based methods (involving grammars) or machine learning techniques. They cover a large range of query inputs, and offer much more results than pure NL processing or regex methods.
The technique I favor is based on Feature based context-free grammars FCFG. For starters, in the NTLK book available online, look for the string "sql0.fcfg". The code example shows how to map the NL phrase structure query "What cities are located in China" into an SQL query "SELECT City FROM city_table WHERE Country="china" via the feature "SEM" or semantics of the FCFG.
I recommend Covington's books
NLP for Prloog Programmers (1994)
Prolog Programming in Depth (1997)
They will help You go a long way. These PDF's are downloadable from his site.
As I commented, I think you should add some code, since not everyone has read the book.
Anyway my conclusion is that yes, as you said it has a lot of limitations and the only way to achieve more complex queries is to write very extensive and complete grammar productions, a pretty hard work.

Best way to scrape a set of pages with mixed content

I’m trying to show a list of lunch venues around the office with their today’s menus. But the problem is the websites that offer the lunch menus, don’t always offer the same kind of content.
For instance, some of the websites offer a nice JSON output. Look at this one, it offers the English/Finnish course names separately and everything I need is available. There are couple of others like this.
But others, don’t always have a nice output. Like this one. The content is laid out in plain HTML and English and Finnish food names are not exactly ordered. Also food properties like (L, VL, VS, G, etc) are just normal text like the food name.
What, in your opinion, is the best way to scrape all these available data in different formats and turn them into usable data? I tried to make a scraper with Node.js (& phantomjs, etc) but it only works with one website, and it’s not that accurate in case of the food names.
Thanks in advance.
You may use something like kimonolabs.com, they are much easier to use and they give you APIs to update your side.
Remember that they are best for tabular data contents.
There my be simple algorithmic solutions to the problem, If there is a list of all available food names this can be really helpful, you find the occurrence of a food name inside a document (for today).
If there is not any food list, You may use TF/IDF. TF/IDF allows to calculate the score of a word inside a document among the current document and also other documents. But this solution needs enough data to work.
I think the best solution is some thing like this:
Creating a list of all available websites that should be scrapped.
Writing driver classes for each website data.
Each driver has the duty of creating the general domain entity from its standard document.
If you can use PHP, Simple HTML Dom Parser along with Guzzle would be a great choice. These two will provide a jQuery like path finder and a nice wrapper arround HTTP.
You are touching really difficult problem. Unfortunately there are no easy solutions.
Actually there are two different parts to solve:
data scraping from different sources
data integration
Let's start with first problem - data scraping from different sources. In my projects I usually process data in several steps. I have dedicated scrapers for all specific sites I want, and process them in the following order:
fetch raw page (unstructured data)
extract data from page (unstructured data)
extract, convert and map data into page-specific model (fully structured data)
map data from fully structured model to common/normalized model
Steps 1-2 are scraping oriented and steps 3-4 are strictly data-extraction / data-integration oriented.
While you can easily implement steps 1-2 relatively easy using your own webscrapers or by utilizing existing web services - data integration is the most difficult part in your case. You will probably require some machine-learning techniques (shallow, domain specific Natural Language Processing) along with custom heuristics.
In case of such a messy input like this one I would process lines separately and use some dictionary to get rid Finnish/English words and analyse what has left. But in this case it will never be 100% accurate due to possibility of human-input errors.
I am also worried that you stack is not very well suited to do such tasks. For such processing I am utilizing Java/Groovy along with integration frameworks (Mule ESB / Spring Integration) in order to coordinate data processing.
In summary: it is really difficult and complex problem. I would rather assume less input data coverage than aiming to be 100% accurate (unless it is really worth it).

What is the space efficiency of a directed acyclic word graph (dawg)? and is there a javascript implementation?

I have a dictionary of keywords that I want to make available for autocomplete/suggestion on the client side of a web application. The ajax turnaround introduces too much latency, so it would nice to store the entire word list on the client.
The list could be hundreds of thousands of words, maybe a couple of million. I did a little bit of research, and it seams that a dawg structure would provide space and lookup efficiency, but I can't find real world numbers.
Also, feel free to suggest other possibilities for achieving the same functionality.
I have recently implemented DAWG for a wordgame playing program. It uses a dictionary consisting of 2,7 million words from Polish language. Source plain text file is about 33MB in size. The same word list represented as DAWG in binary file takes only 5MB. Actual size may vary, as it depends on implementation, so number of vertices - 154k and number of edges - 411k are more important figures.
Still, that amount of data is far too big to handle by JavaScript, as stated above. Trying to process several MB of data will hang JavaScript interpreter for a few minutes, effectively hanging whole browser.
My mind cringes at the two facts "couple of million" and "JavaScript". JS is meant to shuffle little pieces of data around, not megabytes. Just imagine how long users would have to wait for your page to load!
There must be a reason why AJAX turnaround is so slow in your case. Google serves billion of AJAX requests every day and their type ahead is snappy (just try it on www.google.com). So there must be something broken in your setup. Find it and fix it.
Your solution sounds practical, but you still might want to look at, for example, jQuery's autocomplete implementation(s) to see how they deal with latency.
A couple of million words in memory (in JavaScript in a Browser)? That sounds big regardless of what kind of structure you decide to store it in. Your might consider other kinds of optimizations instead, like loading subsets of your wordlist based on the characters typed.
For example, if the user enters "a" then you'd start retrieving all the words that start with "a". Then you could optimize your wordlist by returning more common words first, so the more likely ones will match up "instantly" while less common words may load a little slower.
from my undestanding, DAWGs are good for storing and searching for words, but not when you need to generate lists of matches. Once you located the prefix, you will have to browser thru all its children to reconstruct the words which start with this prefix.
I agree with others, you should consider server-side search.

Categories