Optimized Algorithm to compare Templates of two URLs - javascript

EDITED, Please Read Again, As I added some work of mine
My task is to compare templates of two URLS. I am ready with my algorithm. But it takes too much time to give final answer.
I wrote my code in Java using Jsoup and Selenium
Here, Templates means the way any page presents its contents.
Example:-
Any shopping website have page of any Shoes, that contains,
Images in the left.
Price and Size in the right.
Reviews in the bottom.
If two URLS are of any specific product , then it return "Both are from same templates". Example , this link and this link have same template.
If one URL shows any product and another URL shows any category ,then it shows "No match".
Example, this link and this link are from different template.
I think that this algorithm requires some optimization, that's why I am posting this question in this forum.
My algorithm
Fetch, parse two input URLS and make their DOM trees.
Then if any page contains , UL and TABLE , then remove that tag. I done this because, may be two pages contains different number of items.
Then, I count number of tags in both URLS. say, initial_tag1, initial_tag2.
Then, I start removing tags that have same position on corresponding pages and same Id and their below subtree, if that tree has number of nodes less than 10.
Then, I start removing tags that have same position on coresponding pages and same Class name and their below subtree, if that tree has number of nodes less than 10..
Then, I start removing tags that have no Id ,and No Class name and their below subtree, if that tree has number of nodes less than 10.
Steps 4, 5, 6 have (N*N) complexity. Here, N, is number of tags. [In this way, in every step DOM tree going to shrink]
When it comes out from this recursion, then I check final_tag1 and final_tag2.
If final_tag1 and final_tag2 is less than initial_tag1*(0.2) and initial_tag2*(0.2) then I can say that Two URL matched, otherwise not.
I think a lot about this algorithm, and I found that removing node from DOM tree is pretty slow process. This may be the culprit for slowing this algorithm.
I discussed from some of geeks, and
they said that use a score for every tag instead of removing them, and add them , and > at the end return (score I Got)/(accumulatedPoints) or something similar, and on the
basis of that you decide two URLS are either similar or not.
But I didn't understand this. So can you explain this saying of some geek, or can you give any other optimized algorithm, that solve this problem efficiently.
Thanks in advance. Looking for your kind response.

For comparing webpages there basically two ways, the fast and the slow one :
Compare URLS : fast
Compare DOM : slow (and complicated)
In your case, it appears that the first two items match a similar regular expression and the categories match another regexp.
Here is a short JAVA solution
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TestRegexp {
public static void main(String[] args) {
String URL_ITEM_1 = "http://www.jabong.com/Puma-Flash-Ind-Black-Running-Shoes-187831.html";
String URL_ITEM_2 = "http://www.jabong.com/Lara-Karen-Full-Sleeve-Black-Polyester-Top-With-Cotton-Lace-196636.html";
String URL_CATEGORY_1 = "http://www.jabong.com/kids/shoes/floaters/";
String URL_CATEGORY_2 = "http://www.jabong.com/women/clothing/womens-tops/";
Pattern itemPattern = Pattern.compile("http://www\\.jabong.com/([\\w\\p{Punct}\\d]+)\\.html");
Pattern categoryPattern = Pattern.compile("http://www\\.jabong.com/([\\w\\p{Punct}]+/)+");
System.out.println("Matching items");
Matcher matcher = itemPattern.matcher(URL_ITEM_1);
System.out.println(matcher.matches());
matcher = itemPattern.matcher(URL_ITEM_2);
System.out.println(matcher.matches());
matcher = itemPattern.matcher(URL_CATEGORY_1);
System.out.println(matcher.matches());
matcher = itemPattern.matcher(URL_CATEGORY_2);
System.out.println(matcher.matches());
System.out.println("Matching categories");
Matcher category = categoryPattern.matcher(URL_ITEM_1);
System.out.println(category.matches());
category = categoryPattern.matcher(URL_ITEM_2);
System.out.println(category.matches());
category = categoryPattern.matcher(URL_CATEGORY_1);
System.out.println(category.matches());
category = categoryPattern.matcher(URL_CATEGORY_2);
System.out.println(category.matches());
}
}
And the output :
Matching items
true
true
false
false
Matching categories
false
false
true
true
It validates the first two first URLS as being items, the two last as being categories.
I hope it matches your requirement. Feel free to adapt in JS.

To improve complexity of your algorithm, supposing you are using Jsoup, you must adapt your data structure to your algorithm.
4) What do you mean by position of tag ? the Xpath of the tag ?
If yes, precompute this value once for each tag O(n) and store this value in each node. If required you can also store it in a HashMap to retrieve in O(1).
5) Index you tag by class name using MultiMap. You will save lot of computation
6) Index class with no Id, no class name
All these pre computations can be performed in one traversal of the tree so O(n).
Generally if you want to reduce computation, you will have to store more data in the memory. Since DOM page are very small data, this is no problem in your case.

Related

Testing if an array or linked list is a palindrome

I am doing a problem on Leetcode to write a function which checks to see if a supplied array is a palindrome. They seem to expect the solution to involve creating a linked list from the array and then using the linked list to check if its contents are a palindrome.
Am I right in assuming that the reason for using a linked list (other than to test your programming skills) is that it enables a more efficient (ie takes less processing power) solution than working solely with arrays?
What I find counter intuitive about that is the fact that the function takes an array as its argument so, the data is already in an array. My thinking is that it must take as much processing power to get the array into a linked list as it would take to just go through the elements in the array from each end checking each pair to see if they are equal.
In order to make the linked list you would have to access all the array elements. The only thing I can think is that accessing elements from the end of array might be more 'expensive' than from the front.
I have put my code for solving the problem with an array below:
function isPalindrome(array){
const numberOfTests = Math.floor(array.length/2);
for(let i = 0; i < numberOfTests; i++){
let j = array.length - 1 - i;
if(array[i] !== array[j]){
return false;
}
}
return true;
}
console.log(isPalindrome([1,1,1,2]))
I guess my question is why are they suggesting using linked lists to solve this problem other than to test programming skills? Is there something about my function which is less efficient than using a linked list to accomplish the same task?
Edit:
The code editor for the question is pre-populated with:
/**
* Definition for singly-linked list.
* function ListNode(val, next) {
* this.val = (val===undefined ? 0 : val)
* this.next = (next===undefined ? null : next)
* }
*/
/**
* #param {ListNode} head
* #return {boolean}
*/
var isPalindrome = function(head) {
};
Also from the question:
The number of nodes in the list is in the range [1, 105].
0 <= Node.val <= 9
Follow up: Could you do it in O(n) time and O(1) space?
I am not exactly sure what this all means but I interpreted it as suggesting there are performance issues involve with the algorithm and that using linked lists would be a good way to address them.
The problem is at: https://leetcode.com/problems/palindrome-linked-list/
The code challenge is saying that you are "given the head of a singly linked list". So it is not an array. The misunderstanding may come from the way that LeetCode represents a linked list: they use an array notation for it. But be assured that your function will be called with a linked list, not an array.
Am I right in assuming that the reason for using a linked list (other than to test your programming skills) is that it enables a more efficient (ie takes less processing power) solution than working solely with arrays?
No, it is only for testing programming skills.
What I find counter intuitive about that is the fact that the function takes an array as its argument
This is where you got the code challenge wrong. Look at the description ("Given the head of a singly linked list"), and look at the template code you get to start from (the parameter is named head, not array).
Is there something about my function which is less efficient than using a linked list to accomplish the same task?
Your function will not work. The argument does not have a length property since it is not an array. The argument is an instance of ListNode or null.
In your code you included a call of your function. But that is not how LeetCode will call your function. It will not be called like:
isPalindrome([1, 2, 2, 1])
But like:
isPalindrome(new ListNode(1,
new ListNode(2,
new ListNode(2,
new ListNode(1)))))
From the way you have described it, no matter if we analyse this issue via big-O time complexity or empirical performance, there is no real reason to convert it to a linked list first. It will definitely slow your program down.
This is relatively easy to comprehend: in order to create the linked list, you have to access the whole array. How is this slower than accessing the array elements to determine if it is a palindrome? In terms of array access operations, we are accessing each of the array elements at most once (ideally), in each case. However, with the linked list approach we also have to spend time to create the linked list and then determine if that is a palindrome.
It's like if you're doing a math question, instead of doing it on the piece of paper it was given on, copying it to a piece of parchment first and doing it there. You aren't saving time.
Albeit, the time complexity for both should be O(N) worst-case, and their runtimes should not differ drastically as the difference is only a small constant.
Converting to a linked list is probably only for demonstrative reasons, not performance reasons.
I'll start by reiterating my comment for some context:
One of LeetCode's goals is to help you learn common algorithms, programming patterns, and data structures (language agnostic) in a puzzle-oriented way. There's nothing wrong with your approach, except that the input is not an array, so it is not valid for the problem constraints. The main purpose of this problem is for you to understand what a singly-linked list data structure is and to begin to learn about big O notation.
Based on the details of your question and your follow-up comments, it sounds like you're having trouble with the first part: understanding the structure of a singly-linked list. This is understandable if your experience is in JavaScript: a singly-linked list is not a common data structure in comparison to arrays.
Included in the description details of the problem that you linked to, is the following:
Example 1:
Input: head = [1,2,2,1]
Output: true
The way that the head input argument is shown in the text uses the same syntax as an array of numbers in JavaScript. This is only an abstract (theoretical way of looking at things) representation of a linked list. It does NOT mean literally:
const head = [1, 2, 2, 1];
A linked list is a nested structure of nodes, each having a value and (maybe) a child node. The head input example actually looks like this JavaScript data structure:
const head = {
val: 1,
next: {
val: 2,
next: {
val: 2,
next: {
val: 1,
next: null,
},
},
},
};
This might seem new/confusing to you (and that's ok). This data structure is much more common in some other languages. There will be other problems on LeetCode that will be more familiar to you (and less familiar to programmers who work in those languages): it's part of the challenge and enjoyment of learning.
If the site content maintainers ever consider updating the problem details for each code puzzle, it might be a good idea to provide custom description information based on which language is selected, so that this kind of confusion happens less often.

D3 V4 Tree Search and Highlight

So, I really love this example from Jake Zieve shown here: https://bl.ocks.org/jjzieve/a743242f46321491a950
Basically, on search for a term, the path to that node is highlighted. I would like to accomplish something similar but with the following caveats:
I would like to stay in D3 v4.
I'm concerned about cases where the path doesn't clear out on next node pick OR what happens when there are two nodes of the same
name (I would ideally like to highlight all paths)
I would like to AVOID using JQuery
Given a set search term (assume you're already getting the string from somewhere) I know I need to make use of the following lines specifically (you can see my stream of consciousness in the comments) but I'm just not quite sure where to start.
// Returns array of link objects between nodes.
var links1 = root.descendants().slice(1); //slice to get rid of company.
console.log(links1); //okay, this one is nice because it gives a depth number, this describes the actual link info, including the value, which I am setting link width on.
var links2 = root.links(); // to get objects with source and target properties. From here, I can pull in the parent name from a selected target, then iterate again back up until I get to source. Problem: what if I have TWO of the same named nodes???
console.log(links2);
Thoughts on this? I'll keep trying on my own, but I keep hitting roadblocks. My code can be found here: https://jsfiddle.net/KateJean/7o3suadx/
[UPDATE]
I was able to add a filter to the links2 to call back a specific entry. See
For example:
var searchTerm = "UX Designer"
var match = links2.filter(el => el.target.data.name === searchTerm); //full entry
console.log(match);
This single entry gives me all associated info, including the full list of all points back up to "COMPANY"
So, I can GET the data. I think the best way to accomplish what I want is to somehow add a class to each of these elements and then style on that "active" class.
Thank you!

Regex on multiple lines of random order

I am using BIRT to design a report based on a database, and one of the fields of the form contain multiple lines, like that :
Site: Place ThePlace
Room: D2 RMD3
InstanceId: OI-RandomChars
The fact is, they are not always in this order, as it is user input (in an other form, not BIRT prompt).
And please note that these fields contain capital letters.
So what I want to do is to extract in three separate columns, so with three regex in JavaScript the Site, the Room, and the InstanceId.
I have tried many things with like catching each row until the end of the line or playing around substrings with various conditions ... and so far I think the best to do is to use string function replace to remove anything different than what I want to catch.
An example for the row Room would be :
row["Log"].replace(/?![Room:\s\S*\s]/, "")
I get an error with this but you can see what I try to do.
Thanks for all the consideration about my problem.
A single regex can become overly complicated and hard to read and mantain for this kind of job.
I would probably consider doing it programmatically like your first instinct was.
First I would consider splitting the string into lines
var lines = string.split("\n\n");
This will split all your double lines into an array
["Site: Place ThePlace", "Room: D2 RMD3", "InstanceId: OI-RandomChars"]
Then cycle trough all your lines and then make your checks.
Your check now can actually be regex if you want, or using substring.
This is an example:
var site, room, InstanceId;
var siteCheckRegex = new Regex("^Site:");
for(var i=0; i<lines.length; i++){
if(siteCheckRegex.test(lines[i])){
site = lines[i].replace("Site:","");
}
[...]
}
It actually depends on what you really want to get out of it and the problem and difference you can find in the user input data.

Lucene-like searching through JSON objects in JavaScript

I have a pretty big array of JSON objects (its a music library with properties like artist, album etc, feeding a jqgrid with loadonce=true) and I want to implement lucene-like (google-like) query through whole set - but locally, i.e. in the browser, without communication with web server. Are there any javascript frameworks that will help me?
Go through your records, to create a one time index by combining all search
able fields in a single string field called index.
Store these indexed records in an Array.
Partition the Array on index .. like all a's in one array and so on.
Use the javascript function indexOf() against the index to match the query entered by the user and find records from the partitioned Array.
That was the easy part but, it will support all simple queries in a very efficient manner because the index does not have to be re-created for every query and indexOf operation is very efficient. I have used it for searching up to 2000 records. I used a pre-sorted Array. Actually, that's how Gmail and yahoo mail work. They store your contacts on browser in a pre-sorted array with an index that allows you to see the contact names as you type.
This also gives you a base to build on. Now you can write an advanced query parsing logic on top of it. For example, to support a few simple conditional keywords like - AND OR NOT, will take about 20-30 lines of custom JavaScript code. Or you can find a JS library that will do the parsing for you the way Lucene does.
For a reference implementation of above logic, take a look at how ZmContactList.js sorts and searches the contacts for autocomplete.
You might want to check FullProof, it does exactly that:
https://github.com/reyesr/fullproof
Have you tried CouchDB?
Edit:
How about something along these lines (also see http://jsfiddle.net/7tV3A/1/):
var filtered_collection = [];
var query = 'foo';
$.each(collection, function(i,e){
$.each(e, function(ii, el){
if (el == query) {
filtered_collection.push(e);
}
});
});
The (el == query) part of course could/should be modified to allow more flexible search patterns than exact match.

jQuery "Autocomplete" plugin is messing up the order of my data

I'm using Jorn Zaefferer's Autocomplete plugin on a couple of different pages. In both instances, the order of displayed strings is a little bit messed up.
Example 1: array of strings: basically they are in alphabetical order except for General Knowledge which has been pushed to the top:
General Knowledge,Art and Design,Business Studies,Citizenship,Design and Technology,English,Geography,History,ICT,Mathematics,MFL French,MFL German,MFL Spanish,Music,Physical Education,PSHE,Religious Education,Science,Something Else
Displayed strings:
General Knowledge,Geography,Art and Design,Business Studies,Citizenship,Design and Technology,English,History,ICT,Mathematics,MFL French,MFL German,MFL Spanish,Music,Physical Education,PSHE,Religious Education,Science,Something Else
Note that Geography has been pushed to be the second item, after General Knowledge. The rest are all fine.
Example 2: array of strings: as above but with Cross-curricular instead of General Knowledge.
Cross-curricular,Art and Design,Business Studies,Citizenship,Design and Technology,English,Geography,History,ICT,Mathematics,MFL French,MFL German,MFL Spanish,Music,Physical Education,PSHE,Religious Education,Science,Something Else
Displayed strings:
Cross-curricular,Citizenship,Art and Design,Business Studies,Design and Technology,English,Geography,History,ICT,Mathematics,MFL French,MFL German,MFL Spanish,Music,Physical Education,PSHE,Religious Education,Science,Something Else
Here, Citizenship has been pushed to the number 2 position.
I've experimented a little, and it seems like there's a bug saying "put things that start with the same letter as the first item after the first item and leave the rest alone". Kind of mystifying. I've tried a bit of debugging by triggering alerts inside the autocomplete plugin code but everywhere i can see, it's using the correct order. it seems to be just when its rendered out that it goes wrong.
Any ideas anyone?
max
EDIT - reply to Clint
Thanks for pointing me at the relevant bit of code btw. To make diagnosis simpler i changed the array of values to ["carrot", "apple", "cherry"], which autocomplete re-orders to ["carrot", "cherry", "apple"].
Here's the array that it generates for stMatchSets:
stMatchSets = ({'':[#1={value:"carrot", data:["carrot"], result:"carrot"}, #3={value:"apple", data:["apple"], result:"apple"}, #2={value:"cherry", data:["cherry"], result:"cherry"}], c:[#1#, #2#], a:[#3#]})
So, it's collecting the first letters together into a map, which makes sense as a first-pass matching strategy. What i'd like it to do though, is to use the given array of values, rather than the map, when it comes to populating the displayed list. I can't quite get my head around what's going on with the cache inside the guts of the code (i'm not very experienced with javascript).
SOLVED - i fixed this by hacking the javascript in the plugin.
On line 549 (or 565) we return a variable csub which is an object holding the matching data. Before it's returned, I reorder this so that the order matches the original array of value we were given, ie that we used to build the index in the first place, which i had put into another variable:
csub = csub.sort(function(a,b){ return originalData.indexOf(a.value) > originalData.indexOf(b.value); })
hacky but it works. Personally i think that this behaviour (possibly coded more cleanly) should be the default behaviour of the plugin: ie, the order of results should match the original passed array of possible values. That way the user can sort their array alphabetically if they want (which is trivial) to get the results in alphabetical order, or they can preserve their own 'custom' order.
What I did instead of your solution was to add
if (!q && data[q]){return data[q];}
just above
var csub = [];
found in line ~535.
What this does, if I understood correctly, is to fetch the cached data for when the input is empty, specified in line ~472: stMatchSets[""] = []. Assuming that the cached data for when the input is empty are the first data you provided to begin with, then its all good.
I'm not sure about this autocomplete plugin in particular, but are you sure it's not just trying to give you the best match possible? My autocomplete plugin does some heuristics and does reordering of that nature.
Which brings me to my other answer: there are a million jQuery autocomplete plugins out there. If this one doesn't satisfy you, I'm sure there is another that will.
edit:
In fact, I'm completely certain that's what it's doing. Take a look around line 474:
// loop through the array and create a lookup structure
for ( var i = 0, ol = options.data.length; i < ol; i++ ) {
/* some code */
var firstChar = value.charAt(0).toLowerCase();
// if no lookup array for this character exists, look it up now
if( !stMatchSets[firstChar] )
and so on. So, it's a feature.

Categories