String similarity [duplicate]

String similarity [duplicate] - javascript

I'm building a website that should collect various news feeds and would like the texts to be compared for similarity. What i need is some sort of a news text similarity algorithm.
I know that php has the similar_text function and am not sure how good it is + i need it for javascript.
So if anyone could point me to an example or a plugin or any instruction on how this is possible or at least where to look and start investigating.

There's a javascript implementation of the Levenshtein distance metric, which is often used for text comparisons. If you want to compare whole articles or headlines though you might be better off looking at intersections between the sets of words that make up the text (and frequencies of those words) rather than just string similarity measures.

The question whether two texts are similar is a philosophical one as long as you don't specify exactly what it should mean. Consider the Strings "house" and "mouse". Seen from a semantic level they are not very similar, but they are very similar regarding their "physical appearance", because only one letter is different (and in this case you could go by Levenshtein distance).
To decide about similarity you need an appropriate text representation. You could – for instance – extract and count all n-grams and compare the two resulting frequency-vectors using a similarity measure as e.g. cosine similarity. Or you could stem the words to their root form after having removed all stopwords, sum up their occurrences and use this as input for a similarity measure.
There are plenty approaches and papers about that topic, e.g. this one about short texts. In any case: The higher the abstraction level where you want to decide if two texts are similar the more difficult it will get. I think your question is a non-trivial one (and hence my answer rather abstract) ... ;-)

Related

AI bot for a simple game

There is a relatively simple game with such rules:
There is a safe which needs to be unlocked.
Code to the safe is a 4 digits number without repetitions(1234, 4867, 1092, etc., code like 1231 isn't possible in this game).
The game gives 5 attempts to guess the right code.
Let's say I start a new game and on the first try I test code like 0123.
The game responds with 2-1. 2 means that code 0123 has 2 right numbers which I need to use in the final unlock code. 1 means that one of those 2 numbers is at the correct position already.
After this I have 4 more exact same steps where I try different codes based on the previous tested numbers and responses from the game.
The goal is to get final code, let's say 9135(based on the prev 0123 try) and response from the game needs to be 4-4(4 right numbers, 4 in place). The earlier it happens - better.
I know that this can be solved using combinatorics just by excluding some combinations but I don't know how to choose the most weighted combination for the next try and hope AI can do it better.
I'm a frontend developer and an absolute beginner in AI. I don't really understand how complex code will be to solve this problem and what effort it requires. I will really appreciate if you can explain to me and share some links/code examples(lang doesn't matter but would be good if it is JS or Python) of similar solved tasks, so I can solve my problem based on this.
Feel free to tell me if my explanation wasn't clear, I will try more simple words then:)
Thanks!

Your game sounds similar to Mastermind, only with numbers instead of colored pegs.
Googling "Mastermind AI" leads to e.g. this implementation using a genetic algorithm to solve Mastermind, which you could probably look at for inspiration.

While #AKX is correct that this is a variant of Mastermind, a genetic algorithm might not be the first place to look, as this is probably more complex that simpler approaches.
Donald Knuth is famous (among many other things) for working out a solution to the game. There is a good overview of this approach on the Puzzling Stack Exchange site, and if you look at the other answers on that question, there is also a discussion of how to code the solution.
In your case, the simple approach is to write a function that iterates from 0000 to 9999. These are all potential answers. But, when you iterate through the numbers you want to remove (1) all numbers with duplicate digits and (2) all numbers that are inconsistent with the guesses so far. Any other numbers can be put in an array or list storing potential answers. From these remaining numbers, you can just guess any number and then continue the process.
A more complicated approach would be to make the next guess using an algorithm similar to ID3 to try to find the guess that maximizes the information gain you get from the response. But, given how much information you get from each guess, this is unlikely to be needed.

Find the closest coordinate from a set of coordinates

I have about 1000 set of geographical coordinates (lat, long).
Given one coordinate i want to find the closest one from that set. My approach was to measure the distance but on hundreds requests per second can be a little rough to the server doing all that math.
What is the best optimized solution for this?
Thanks

You will want to use the 'Nearest Neighbor Algorithm'.

You can use this library sphere-knn, or look at something like PostGIS.

Why not select the potential closest points from the set (eg set a threshold, say, 0.1 and filter the set so that you have any points with +-0.1 in both axes from your target point). Then do that actual calcs on this set.
If none are within the first range, just enlarge it (0,2) and repeat (0.3, 0.4...) until you've got a match. Obviously you would tune the threshold so it best matched your likely results.
(I'm assuming the time-consulming bit is the actual distance calculation, so the idea is to limit the number of calculations.)

An Algorithmic Response
Your approach is already O(n) in time. It's algorithmically very fast, and fairly simple to implement.
If that is not enough, you should consider taking a look at R-trees. The idea behind R-trees is roughly paraphrased as follows:
You already have a set of n elements. You can preprocess this data to form rough 'squares' of regions each containing a set of points, with an established boundary.
Now say a new element comes in. Instead of comparing across every coordinate, you identify which 'square' it belongs in by just comparing whether the point is smaller than the boundaries, and then measure the distance with only the points inside that square.
You can see at once the benefits:
You are no longer comparing against all coordinates, but instead only the boundaries (strictly less than the number of all elements) and then against the number of coordinates within your chosen boundary (also less than the number of all elements).
The upper bound of such an algorithm is O(n) time. The lower bound may, on average, be O(log n).
The main improvement is mostly in the pre-processing step (which is 'free' in that it's a one-time cost) and in the reduced number of comparisons needed.
A Systemic Response
Just buy another server, and distribute the requests and the elements using a load balancer such as Haproxy.
Servers are fairly cheap, especially if they are critical to your business, and if you want to be fast, it's an easy way to scale.

Algorithms: Aggregate of substrings to determine relevant information

I am trying to do an aggregate algorithm that will get the most important elements in a text based on user highlights.
Imagine you have a text having n words where you have the ability to select k continuous words from the text as a "relevant highlight", where 1<=k<=n.(k is a substring of n)
Assuming we select anywhere from 10 to 10000 of these k highlights, is there any algorithm that can determine the most important information?
Consider that many of the highlights would overlap and we need to take that into account. I am also preferably looking for a solution in javascript since it's for a chrome extension.
This is NOT for a class, this is for a personal project concerning crowd-based summarization.

Suppose that each user highlights some stretches of text and that you know what those highlights are. You could sum, for each word in the text, how many people highlighted it. One thing you could calculate is, for some fixed k and N, a total of k stretches using at most N words in all, such the sum of the number of times those N words were highlighted was a maximum.
You can do this with dynamic programming, working left to right within the text. For each point in the text and each possible allowed combination of (# highlights, # total words highlighted, whether current word is highlighted) you work out the score for the best answer terminating at that point satisfying those constraints. You can work out the best answer at each point by using the best answers for the previous word - consider the possible scores you get if you take any one of the existing best answers and either extend a current highlight, if that last word was highlighted, or start a new highlight. At the end you track the best answer for the whole text back from right to left.
This gives you a summary in the form of the best section of k stretches to highlight, using at most N words to pick up as many of the words highlighted by users as possible. No doubt there are variations on this for different scores or for different highlighting constraints - it might be easier to compute the best combination of k stretches, where each stretch is of at most M characters.

Equations Game. Finding solution for a random Goal from random Resource

I'm writing an AI for a game called Equations in Javascript.
For the sake of the question, let's pretend that the game is this simple:
There is a Goal, which can be a number(eg 5) or an expression (that can be
evaluated to a number. eg: 2+3).
There are 20 random numbers(1-9) and operators(+-*/) I can use, let's call them
the array resources[]. I need to find one combination of the elements
in resources[] that is evaluated to the Goal, let's call that the
solution (eg 1+6-2+1).
There is no limit of how many numbers or operators I can use, as long
as they are in resource[]. Once they are used, they cannot be used
again. So the longest solution might be 20 symbols long.
Is there a way I can quickly find such solution? The AI might need to evaluate this many times when analysing a move's score.
Thanks guys

Grade Sudoku difficulty level

I am building a Sudoku game for fun, written in Javascript.
Everything works fine, board is generated completely with a single solution each time.
My only problem is, and this is what's keeping me from having my project released to public
is that I don't know how to grade my boards for difficulty levels. I've looked EVERYWHERE,
posted on forums, etc. I don't want to write the algorithms myself, thats not the point of this
project,and beside, they are too complex for me, as i am no mathematician.
The only thing i came close to was is this website that does grading via JSbut the problem is, the code is written in such a lousy undocumented, very ad-hoc manner,therefor cannot be borrowed...
I'll come to the point -Can anyone please point me to a place which offers a source code for Sudoku grading/rating?
Thanks
Update 22.6.11:
This is my Sudoku game, and I've implemented my own grading system which relies
on basic human logic solving techniques, so check it out.

I have considered this problem myself and the best I can do is to decide how difficult the puzzle is to solve by actually solving it and analyzing the game tree.
Initially:
Implement your solver using "human rules", not with algorithms unlikely to be used by human players. (An interesting problem in its own right.) Score each logical rule in your solver according to its difficulty for humans to use. Use values in the hundreds or larger so you have freedom to adjust the scores relative to each other.
Solve the puzzle. At each position:
Enumerate all new cells which can be logically deduced at the current game position.
The score of each deduction (completely solving one cell) is the score of the easiest rule that suffices to make that deduction.
EDIT: If more than one rule must be applied together, or one rule multiple times, to make a single deduction, track it as a single "compound" rule application. To score a compound, maybe use the minimum number of individual rule applications to solve a cell times the sum of the scores of each. (Considerably more mental effort is required for such deductions.) Calculating that minimum number of applications could be a CPU-intensive effort depending on your rules set. Any rule application that completely solves one or more cells should be rolled back before continuing to explore the position.
Exclude all deductions with a score higher than the minimum among all deductions. (The logic here is that the player will not perceive the harder ones, having perceived an easier one and taken it; and also, this promises to prune a lot of computation out of the decision process.)
The minimum score at the current position, divided by the number of "easiest" deductions (if many exist, finding one is easier) is the difficulty of that position. So if rule A is the easiest applicable rule with score 20 and can be applied in 4 cells, the position has score 5.
Choose one of the "easiest" deductions at random as your play and advance to the next game position. I suggest retaining only completely solved cells for the next position, passing no other state. This is wasteful of CPU of course, repeating computations already done, but the goal is to simulate human play.
The puzzle's overall difficulty is the sum of the scores of the positions in your path through the game tree.
EDIT: Alternative position score: Instead of completely excluding deductions using harder rules, calculate overall difficulty of each rule (or compound application) and choose the minimum. (The logic here is that if rule A has score 50 and rule B has score 400, and rule A can be applied in one cell but rule B can be applied in ten, then the position score is 40 because the player is more likely to spot one of the ten harder plays than the single easier one. But this would require you to compute all possibilities.)
EDIT: Alternative suggested by Briguy37: Include all deductions in the position score. Score each position as 1 / (1/d1 + 1/d2 + ...) where d1, d2, etc. are the individual deductions. (This basically computes "resistance to making any deduction" at a position given individual "deduction resistances" d1, d2, etc. But this would require you to compute all possibilities.)
Hopefully this scoring strategy will produce a metric for puzzles that increases as your subjective appraisal of difficulty increases. If it does not, then adjusting the scores of your rules (or your choice of heuristic from the above options) may achieve the desired correlation. Once you have achieved a consistent correlation between score and subjective experience, you should be able to judge what the numeric thresholds of "easy", "hard", etc. should be. And then you're done!

Donald Knuth studied the problem and came up with the Dancing Links algorithm for solving sudoku, and then rating the difficulty of them.
Google around, there are several implementations of the Dancing Links engine.

Perhaps you could grade the general "constrainedness" of a puzzle? Consider that a new puzzle (with only hints) might have a certain number of cells which can be determined simply by eliminating the values which it cannot contain. We could say these cells are "constrained" to a smaller number of possible values than the typical cell and the more highly constrained cells that exist the more progress one can make on the puzzle without guessing. (Here we consider the requirement for "guessing" to be what makes a puzzle hard.)
At some point, however, the player must start guessing and, again, the constrainedness of a cell is important because with fewer values to choose between for a given cell the easier it is to find the correct value (and increase the constrainedness of other cells).
Of course, I don't actually play Sudoku (I just enjoy writing games and solvers for it), so I have no idea if this is a valid metric, just thinking out loud =)

I have a simple solver that looks for only unique possibilities in rows, columns and squares. When it has solved the few cells solvable by this method, it then picks a remaining candidate tries it and sees if the simple solver then leads to either a solution or a cell empty of possibilities. In the first case the puzzle is solved, in the second, one possibility has shown to be infeasible and thus eliminated. In the third case, which leads to neither a final solution nor an infeasibility, no
deduction can be reached.
The primary result of cycling through this procedure is to eliminate possiblities until picking
a correct cell entry leads to a solution. So far this procedure has solved even the hardest
puzzles without fail. It solves without difficulty puzzles with multiple solutions. If the
trial candidates are picked a random, it will generate all possilbe solutions.
I then generate a difficulty for the puzzle based on the number of illegal candidates that must
be eliminated before the simple solver can find a solution.
I know that this is like guessing, but if simple logic can eliminated a possible candidate, then one
is closer to the final solution.
Mike

I've done this in the past.
The key is that you have to figure out which rules to use from a human logic perspective. The example you provide details a number of different human logic patterns as a list on the right-risde.
You actually need to solve the puzzle using these rules instead of computer rules (which can solve it in milliseconds using simple pattern replacement). Every time you change the board, you can start over from the 'easiest' pattern (say, single open boxes in a cell or row), and move down the chain until you find one the next logical 'rule' to use.
When scoring the sodoku, each methodology is assigned some point value, which you would add up for every field you needed to fill out. While 'single empty cell' might get a 0, 'XY Chain' might get 100. You tabulate all of the methods needed (and frequency) and you wind up with a final weighting. There are plenty of places that list expected values for those weightings, but they are all fairly empirical. You're trying to model human logic, so feel free to come up with your own weightings or enhance the system (if you really only use XY chains, the puzzle is probably easier than if it requires more advanced mechanisms).
You may also find that even though you have a unique sodoku, that it is unsolvable through human logic.
And also note that this is all far more CPU intensive than solving it in a standard, patterned way. Some years ago when I wrote my code, it was taking multiple (I forget exactly, but maybe even up to 15) seconds to solve some of the generated puzzles I'd created.

Assuming difficulty is directly proportional to the time it takes a user to solve the puzzle, here is an Artificially Intelligent solution that approaches the results of the ideal algorithm over time.
Randomly generate a fixed number of starting puzzle layouts, say 100.
Initially, offer a random difficulty section that let's a user play random puzzles from the available layouts.
Keep an average random solution time for each user. I would probably make a top 10/top X leaderboard for this to generate interest in playing random puzzles.
Keep an average solution time multiplier for each puzzle solution (if the user normally solves the puzzle in 5 minutes and solves it in 20 minutes, 4 should be figured in to the puzzles average solution time multiplier)
Once a puzzle has been played enough times to get a base difficulty for the puzzle, say 5 times, add that puzzle to your list of rated puzzles and add another randomly generated puzzle to your available puzzle layouts.
Note: You should keep the first puzzle in your random puzzles list so that you can get better and better statistics on it.
Once you have enough base-rated puzzles, say 50, allow users to access the "Rated Difficulty" portion of your application. The difficulty for each puzzle will be the average time multiplier for that puzzle.
Note: When users choose to play puzzles with rated difficulty, this should NOT affect the average random solution time or average solution time multiplier, unless you want to get into calculating weighted averages (otherwise if a user plays a lot of harder puzzles, their average time and time multipliers will be skewed).
Using the method above, a solution would be rated from 0 (already solved/no time to solve) to 1 (users will probably solve this puzzle in their average time) to 2 (users will probably take twice as long to solve this puzzle than their average time) to infinity (users will take forever to find a solution to this puzzle).

We Keep Coding

JavaScript is the programming language of the Web.