Find the closest coordinate from a set of coordinates - javascript

I have about 1000 set of geographical coordinates (lat, long).
Given one coordinate i want to find the closest one from that set. My approach was to measure the distance but on hundreds requests per second can be a little rough to the server doing all that math.
What is the best optimized solution for this?
Thanks

You will want to use the 'Nearest Neighbor Algorithm'.

You can use this library sphere-knn, or look at something like PostGIS.

Why not select the potential closest points from the set (eg set a threshold, say, 0.1 and filter the set so that you have any points with +-0.1 in both axes from your target point). Then do that actual calcs on this set.
If none are within the first range, just enlarge it (0,2) and repeat (0.3, 0.4...) until you've got a match. Obviously you would tune the threshold so it best matched your likely results.
(I'm assuming the time-consulming bit is the actual distance calculation, so the idea is to limit the number of calculations.)

An Algorithmic Response
Your approach is already O(n) in time. It's algorithmically very fast, and fairly simple to implement.
If that is not enough, you should consider taking a look at R-trees. The idea behind R-trees is roughly paraphrased as follows:
You already have a set of n elements. You can preprocess this data to form rough 'squares' of regions each containing a set of points, with an established boundary.
Now say a new element comes in. Instead of comparing across every coordinate, you identify which 'square' it belongs in by just comparing whether the point is smaller than the boundaries, and then measure the distance with only the points inside that square.
You can see at once the benefits:
You are no longer comparing against all coordinates, but instead only the boundaries (strictly less than the number of all elements) and then against the number of coordinates within your chosen boundary (also less than the number of all elements).
The upper bound of such an algorithm is O(n) time. The lower bound may, on average, be O(log n).
The main improvement is mostly in the pre-processing step (which is 'free' in that it's a one-time cost) and in the reduced number of comparisons needed.
A Systemic Response
Just buy another server, and distribute the requests and the elements using a load balancer such as Haproxy.
Servers are fairly cheap, especially if they are critical to your business, and if you want to be fast, it's an easy way to scale.

Related

How to identify breakpoints (trend lines edges) in a dataset?

I've been performing some research, in order to find the best approach to identify break points (trend direction change) in a dataset (with pairs of x/y coordinates), that allow me to identify the trend lines behind my data collections.
However I had no luck, finding anything that brings me some light.
The yellow dots in the following image, represent the breakpoints I need to detect.
Any suggestion about an article, algorithm, or implementation example (typescript prefered) would be very helpful and appreciated.
Usually, people tend to filter the data by looking only maximums (support) or only minimums (resistance). A trend line could be the average of those. The breakpoints are when the data crosses the trend, but this gives a lot of false breakpoints. Because images are better than words, you can look at page 2 of http://www.meacse.org/ijcar/archives/128.pdf.
There are a lot of scripts available look for "ZigZag" in
https://www.tradingview.com/
e.g. https://www.tradingview.com/script/lj8djt1n-ZigZag/ https://www.tradingview.com/script/prH14cfo-Trend-Direction-Helper-ZigZag-and-S-R-and-HH-LL-labels/
Also you can find an interesting blog post here (but code in in python):
https://towardsdatascience.com/programmatic-identification-of-support-resistance-trend-lines-with-python-d797a4a90530
with code available: https://pypi.org/project/trendln/
If you can identify trend lines then can't you just identify a breakpoint as when the slope changes? If you can't identify trend lines, then can you for example, take a 5-day moving average and see when that changes slope?
This might sound strange, or even controversial, but -- there are no "breakpoints". Even looking at your image, the fourth breakpoint might as well be on the local maximum immediately before its current position. So, different people might call "breakpoints" different points on the same graph (and, indeed, they do).
What you have in the numbers are several possible moving averages (calculated on a varying interval, so you might consider MA5 for five-day average, or MA7 for a week average) and their first and maybe second derivatives (if you feel fancy you can experiment with third derivatives). If you plot all these lines, suitably smoothed, over your data, you will notice that the salient points of some of them will roughly intersect near the "breakpoints". Those are the parameters that your brain considers when you "see" the breakpoints; it is why you see there the breakpoints, and not somewhere else.
Another method that the human vision employs to recognize features is trimming outliers: you discard in the above calculations either any value outside a given tolerance, or a fixed percentage of all values starting from those farther from the average. You can also experiment not trimming those values that are outliers for longer-period moving averages but are not for shorter periods (this gives better responsivity but will find more "breakpoints"). Then you run the same calculations on the remaining data.
Finally you can attribute a "breakpoint score" based on weighing the distance from nearby salient points. Then, choose a desired breakpoint distance and call "breakpoint" the highest scoring point in that interval; repeat for all subsequent breakpoints. Again, you may want to experiment with different intervals. This allows for a conveniently paced breakpoint set.
And, finally, you will probably notice that different kinds of signal sources have different "best" breakpoint parameters, so there is no one "One-Size-Fits-All" parameter set.
If you're building an interface to display data, leaving the control of these parameters to the user might be a good idea.

How could I compare two data sets of X,Y coordinates to look for similarities?

Pretty abstract question, but I'm conducting some research on gesture-based recognition. I've managed to get the gesture to be outputted into a series of X,Y coordinates that I can view as a scatter graph:
Here's my problem; I'm unsure on how to proceed. What is the best way to compare two data sets of X,Y coordinates and give a confidence percentage on how similar they are?
I'm currently using JavaScript and would ideally like to keep using it.
Reading about handwriting recognition software it seems that the early phases such a recognising the strokes might be helpful. Decompose the gesture into a number of elements (lines or curves) then apply some matching algorithms.
There are a couple of other questions that may be had here, for example: if a we have two identical gestures but one would be much slower than the other and take ten times the time, would they be considered similar?
Anyway, for starters I would look at each moment in time at the positions of the cursor in both gestures and determine the geometrical distance between them. Then you could compute a 'deviation' number of one gesture from the other, and if the number is big, then the gestures might not be similar. This could be a starting point.

Looking for an algorithm to cluster 3d points, around 2d points

I'm trying to cluster photos (GPS + timestamp) around known GPS locations.
3d points = 2d + time stamp.
For example:
I walk along the road and take photos of lampposts, some are interesting and so I take 10 photos and others are not so I don't take any.
I'd like to cluster my photos around the lampposts, allowing me to see which lamppost was being photographed.
I've been looking at something like k-means clustering and wanted something intelligent than just snapping the photos to nearest lamppost.
(I'm going to write the code in javascript for a client side app handing about (2000,500) points at a time )
KMeans Clustering is indeed a popular and easy to implement algorithm, but it has a couple of problems.
You need to feed him the number of clusters N as an input
variable. Now, since I assume you don't know how many "things" you
want to photoigraph, finding the right N. Using Iterative KMeans or similar variations only slides the problem to finding a proper evaluation function for multicluster partitions, that's in no way easier then finding N itself.
It can only detect linearly separable shapes. Let's say you are walking around Versailles, and you take a lot of pictures of the external walls. Then you move inside, and take pictures of the inside garden. The two shapes you obtain are a thorus with a disk inside it, but KMeans can't distinguish them.
Personally, I'd go with some sort of Density Based Clustering : you'll still have to feed the algorithm some variables,but, since we assume that the space will be Euclidian, finding them shouldn't take too much. Plus, it gives you the ability to distinct Noise points from Cluster points, and let you treat them differently.
Furthermore, it can distinguish between most shapes, and you don't need to give the number of cluster beforehand.
Density based clustering, such as DBSCAN, definitely is the way to go.
The two parameters of DBSCAN should be quite obvious to set:
epsilon: this is the radius for clustering, so e.g. you could use 10 meters, assuming that there are no lampposts closer than 10 meters. (You should be using Geodetic distance, not Euclidean!)
minPts: essentially the minimum size of a cluster. You can use 1 or 2, even.
distance: this parameter is implicit, but probably more important. You can use a combination of space and time here. E.g. 10 meters spatially, and 1 year in the time domain. See Generalized DBSCAN for the more flexible version, which makes it obvious how to use multiple domains.
You can use a delaunay triangulation to look for nearest points. It gives you a nearest neighbor graph where the points is on the delaunay edges. Or you can cluster by color like in photo mosaic. It uses an anti pole tree. Here is a similar answer: Algorithm to find for all points in set A the nearest neighbor in set B

Algorithm problem: Packing rods into a row

Alright, this might be a tricky problem. It is actually an analogy for another similar problem relating to my actual application, but I've simplified it into this hypothetical problem for clarity. Here goes:
I have a line of rods I need to be sorted. Because it is a line, only 1 dimension needs to be of concern.
Rods are different lengths and different weights. There is no correlation between weight and length. A small rod can be extremely heavy, while a large rod can be very light.
The rods need to be sorted by weight.
The real catch is, however, some rods can only be placed no more than certain distances from the start of the line, regardless of their weight. Anywhere before that is fine, though.
No guarantee is given that constraints will be spaced enough away from each other to prevent the possibility of constrained rods being squeezed into overlapping. In this (hopefully rare) case, either the rods need to be re-arranged somehow within their constraints to create the needed space, or an ideal compromise solution may need to be found (such as violating a constraint of the least light rod, for example).
It is possible at a future date that additional constraints may be added *in addition to the length constraint to indicate specific (and even non-compromising) boundaries within the line where rods cannot overlap into.
My current solution does not account for the latter situations, and they sound like they'll involve some complex work to resolve them.
Note that this is for a client-side web application, so making the solution apply to Javascript would be helpful!
If it is possible I'd suggest formulating this as a mixed integer program. If you can encode the constraints in this was you can use a solver to satisfy the constraints.
See this page for some more info on this type of approach:
http://en.wikipedia.org/wiki/Linear_programming
If you can interface this to Javascript somehow then it might prove to be an elegant solution.
At first, I tried to approach this as a sorting problem. But I think it is better to think of it as an optimization problem. Let me try to formalize the problem. Given:
wi: weight of rod i
li: length of rod i
mi: maximum distance of rod i from origin. If there is no constraint, you can set this value to sum(i=1,n, li)
The problem is to find a permutation ai, such that the cost function:
J=sum(i=1,n, wai*sum(j=1,i-1, laj))
is minimized and the constraints:
sum(j=1,i-1, laj) <= mi, 1 <= i<n
are satisfied.
I am not sure this is a correct formulation, though. Without any constraints, the optimal solution is not always the rods sorted by weight. For example, let l={1,4}, and w={1,3}. If a={1,2}, then J is 1*0+3*1=3, and if a={2,1} (sorted by weight), J is 3*0+1*4=4. Clearly, the unsorted solution minimizes the cost function, but I am not sure this is what you want.
Also, I don't know how to solve the problem yet. You could try a heuristic search of some kind in the short term. I am writing this reformulation so that someone else can provide a solution while I think more about the solution. If it is correct, of course.
Another thing to note is that you don't have to find the complete solution to see if there is a solution. You can ignore the rods without position constraints, and try to solve the problem with only the constrained rods. If there is a solution to this, then the problem does have a solution (an obvious suboptimal solution is to sort the unconstrained rods, and append them to the solution of the reduced problem).
After saying all this, I think the algorithm below would do the trick. I will describe it a bit visually to make it easier to understand. The idea is to place rods on a line segment from left to right (origin is the leftmost point of the line segment) as per your problem description.
Separate out the rods with position constraints on them. Then, place them such that they are at the limit of their constrained positions.
If there are no overlapping rods, goto step 4
For each overlapping pair of rods, move the one closer to origin towards the origin so that they are no longer overlapping. This step may require other rods on the line to be shifted towards the origin to open up some space. You detect this by checking if the moved rod now overlaps with the one just to the left of it. If you cannot create enough space (moving the rod closest to origin to 0 still doesn't free up enough space), then there is no solution to the problem. Here, you have the opportunity to find a solution by relaxing the constraint on the rightmost rod of the original overlapping pair: just move it away from origin until there is no overlap (you may need to push preceding rods right until all overlaps are fixed before you do this).
Now, we have some rods placed, and some free spaces around them. Start filling up the free space with the heaviest rods (including the ones with constraints which are to the right of the free space) that would fit in it. If you cannot find any rods that would fit, simply shift the next rod on the right of the free space to close the gap.
Repeat step 4 until you reach the rightmost constrained rod. The remaining line segment is all free space.
Sort all left over rods by weight, and place them in the remaining free space.
A few notes about the algorithm:
It doesn't solve the problem I stated earlier. It tries to sort the rods according to their weights only.
I think there are some lost opportunities to do better, because we slide some rods towards the origin to make them all fit (in step 3), and sometimes pick the heavy rods from these "squeezed in" places, and put them closer to origin (in step 4). This frees up some room, but we don't slide the pushed away rods back to the limits of their constrained positions. It may be possible to do this, but I will have to revise the algorithm when my brain is working better.
It is not a terribly efficient algorithm. I have a feeling that it can be done in O(n^2), but anything better would require creative data structures. You need to be able to find the heaviest rod with length less than a given L faster than O(n) to do better.
I am not very good at solving algos. But here goes my attempt:
Relate this to a Knapsack problem
Instead of the return cost or value
of a box, let them be assigned the
higher value to the ones having
lesser limit of going farther.
Some thing like you are trying to
pack everything closer to the
starting point rather than into a
Knapsack as per the Knapsack problem.
As for the future date & modification
is concerned, I believe,using constraints which
are similar would require a modification in
the return value or cost of the box
only.
I'm 99% certain this can be cast as an integer knapsack problem with an extra constraint which, I think, can be accommodated by first considering the rods with the distance-from-start condition.
Here's a link to an explanation of the knapsack problem: http://www.g12.cs.mu.oz.au/wiki/doku.php?id=simple_knapsack

Variation on Travelling salesman problem - pick a good subroute from many nodes based on constraints

TLDR version: Most important issue is, that in a TSP problem, instead of finding the shortest Hamiltonian cycle, what are good ways to find the best path (I suppose the one that visits the most nodes) which is at most X length, with a fixed starting point.
Full version:
I'm interested in some ideas for a problem that involves TSP.
First, an example real-world TSP problem is when you have N geographical locations to visit and you need driving directions for an optimal route (or near-optimal) to visit all, either a roundtrip or A to Z. There is a nice JS implementation for this at http://www.gebweb.net/optimap/ and a JS TSP solver available at http://code.google.com/p/google-maps-tsp-solver/.
Now consider that you have N = 100 - 1000+ locations. At that point you cannot calculate the route with any reasonable amount of time/resources, but even if it were possible that is not that useful for most real world scenarios. Let's say you pick a fixed starting point and based on that, from those 1000+ locations you want to generate an optimal subroute which fits into a (relatively small) max constraint (for example, a route that can be covered in 1 day or 1 week).
How can this be solved in near real time?
My thoughts sofar:
Build the duration matrix from
starting point (this step is
feasible even at a few thousand
points) and pick a small subset of
points which are closest to the
starting point. Ideally this subset
should be large enough, that
visiting it fully is definitely >
max constraint, but small enough to process quickly, at least with
heuristic algorithms.
Find an optimal route considering
the locations chosen in step 1. But
instead of a route that visits all
points from this set, I need the
best route which satisfies max
constraint thus it should not
visit all points (it can visit
all but that would be the edge
case). I'm especially not sure on
how it would be best to tackle this
one in an efficient way?
Any links, or ideas appreciated, especially for point 2.
Disclaimer: Of course the core of the problem is language-agnostic, I'm using JS/Google Maps as an example of real world application.
OK, here's my sketch of a solution, in pseudo code. You need to know about Priority Queues
Define a data structure Route {
the route taken so far,
and can give the score (nodes visited)
the distance traveled
and the remaining nodes allowed
the current location.
}
Define a Priority Queue of Routes which sorts on distance traveled.
Define a SortedSet of Routes which sorts on score called solutions.
Add a Route of Length 0, at the depot to the priority queue.
Loop until the head of the priority queue is greater than MAX
Route r = take the head of the priority queue.
for each potential remaining node n in r,
(in increasing order of distance from the current location)
child = a new route of r with n appended
if child distance < max, append to priority queue, otherwise break
if no children were appended add r to solutions
This effectively does a breadth first search of the solution space, in a reasonably memory efficient manner. Moreover, if it is too slow, then when looping over all children you can speed up the algorithm by only taking N nearest neighbors, where N is configurable. You may miss the optimum solution, but it should be reasonable in most cases.

Categories