I have written this regexp: <(a*)\b[^>]*>.*?</\1>
and is tested on this regexp testing site: http://gskinner.com/RegExr/?2tntr
The point of the regexp is to go through a sites HTML and find all of the links. It should then return these in an Array for me to manipulate.
On the regexp testing site it works perfectly, but when put in action with JavaScript on my site it returns null.
JavaScript looks like this:
var data = $('#mainDivOnMiddleOfPage').html();
var pattern = "<(a*).*href=.*>.*</a>";
var modi = "g";
var patt = new RegExp(pattern, modi);
var result = patt.exec(data);
jQuery gets the content of the page. This is tested and verified.
Question is, why does this return null in JavaScript but what it is supposed to return in the regexp tester?
All <a> links:
<a[^>]*?\bhref=['\"](.*?)['\"]
Absolute links only (starting with http):
<a[^>]*?\bhref=['\"](http.*?)['\"]
JavaScript code:
var html = '<a href="test.html">';
var m = html.match(/<a[^>]*?\bhref=['"](.*?)['"]/);
print (m[1]);
See and test the code here.
I use the following code to do the same thing and it works for me, try it out
var data = document.getElementById('mainDivOnMiddleOfPage').textContent;
var result = data.match(/<(a*).*href=.*>.*<\/a>/);
Going to go ahead and post this here, since I think it's what you want -- it is not a RegEx solution, however.
$(function(){
$.ajax({
url: "test.htm",
success: function(data){
var array_of_links = $.makeArray($("a",data));
// do your stuff here
}
});
});
I'm conscious an answer has been chosen. However it's worth mentioning that the current REGEX solutions match the tags but not the actual HREFs in isolation.
This is where JavaScript falls down, since its somewhat simplistic implementation of REGEX does not allow for the capturing of sub-groups when the global g flag is specified.
One way round this is to exploit the REGEX replacement callback. This will get just the link HREFs, not the tags.
var html = document.body.innerHTML,
links = [];
html.replace(/<a[^>]*?href=('|")(.*?)\1/gi, function($0, $1, $2) {
links.push($2);
});
//links is now an array of hrefs
It also uses a back-reference to close the href attribute, i.e. making sure both opening and closing quote are single or double, not mixed.
Sidenote: as others have mentioned, where possible, you'd want to DOM this rather than REGEX.
"The point of the regexp is to go through a sites HTML and find all of the links. It should then return these in an Array for me to manipulate."
I won't add another regex answer, but just want to point out that if you have hold of the document (not just the html) then it's easier to walk trhough the links collection. That contains all <a href="">'s but also all <area> elements:
for (var link, links = document.links, n = links.length, i=0; i<n; i++){
link = links[i];
switch (link.tagName){
case "A":
//do something with the link
break;
case "AREA":
//do something with the area.
break;
}
}
Your problem is that you are not compiling your regex:
patt.compile();
You have to call it before using with the exec() method.
Related
I have an Array with one or more entries. Each one is a string (List of urls in open Tabs via Firefox SDK). I want to check if a specific url is already opened in some of the tabs (nothing special till now).
My problem is, that the url in tab list can have four diffrent fourms. For example:
Url I want to find in the tablist:
https://cmsr-author.de/cf#/content/test/de.html
But the url can also look like this:
https://cmsr-author.de/content/test/de.html
https://cmsr-author.de/test/de.html
https://cmsr-author.de/cf#/test/de.html
Of course the last part of the url (after /test/...) is always something diffrent. If I wasn't able to find one of the four urls in the tablist i want to call some other action.
My Solution till now is to build some if-chain:
if (res !== url1) {
if (res !== url2) {
if ...
But i thought there must be some more elegant way. Maybe via RegEx? I already have a capture to catch the first part (which stays the same https://cmsr-author.ws...) with it four forms. But i dont know how to implent this probably.
var urls = ["https://cmsr-author.de/content/test/de.html","https://cmsr-author.de/test/de.html","https://cmsr-author.de/cf#/test/de.html"]
var filtered = urls.filter(function(url)
{
return url.indexOf("cf#") > -1 && url.endsWith("/test/de.html")
})
var contains = filtered.length > 0
console.log(contains)
If you want to use regex you can do this by using groups for the middle part, which is explained in detail here: http://www.regular-expressions.info/refcapture.html
Practically, your regex would look something like that:
https:\/\/cmsr-author\.de\/(content|...|...)\/de\.html
Where ... must be replaced by the middle parts of the url which differ.
Note that | is "or" used to provide multiple possibilities within the group. The character / and . must be escaped since they have special roles in regex.
I hope that helps!
My English is not good,Do not fully understand what you mean,According to my idea,You should need a regular expression,Only to match the first.If I am wrong,
please # me.
I hope that helps!
var reg = /^https:\/\/cmsr\-author\.de\/cf#\/(?:\w+\/)+test\/de\.html$/gi;
var str1 = "https://cmsr-author.de/cf#/content/test/de.html";
var str2 = "https://cmsr-author.de/content/test/de.html";
var str3 = "https://cmsr-author.de/test/de.html";
var str4 = "https://cmsr-author.de/cf#/test/de.html";
console.log(reg.test(str1));
console.log(reg.test(str2));
console.log(reg.test(str3));
console.log(reg.test(str4));
I would like to build my own translation function in javascript.
I already have a function language.lookup(key) which translates a word or expression:
var frenchHello = language.lookup('hello') //'bonjour'
Now I would like to write a function which takes a html string and translates it with my lookup function. In the html string I will have a special syntax for example #[translationkey] that will point out that this word should be translated.
This is the result I want:
var html = '<div><span>#[hello]</span><span>#[sir]</span>'
language.translate(html) //'<div><span>bonjour</span><span>monsieur</span>
How would I write language.translate?
My idea is to filter out my special syntax with regex and then run language.lookup on each key. Maybe with string replace or something.
I suck when it comes to regex and I've only come up with a very incomplete example but I include it anyway so maybe someone get the idea of what I am trying to do. Then if there is a better but complete different solution that is more than welcome.
var value = "#[hello], nice to see you.";
lookup = function(word){
return "bonjour";
};
var res = new RegExp( "\\b(hello)\\b", "gi" ).exec(value)
for (var c1 = 0; c1 < res.length; c1++){
value = value.replace(res[c1], lookup(res[c1]))
}
alert(value) //#[bonjour], nice to see you.
The regex should of course not filter out the word hello but the syntax and then collect the key by grouping or similar.
Can anyone help?
Just use String.replace method's ability to call function specified as second argument to generate replacement text and make a global replace using regexp matching your syntax:
var value = "#[hello], #[sir], nice to see you.";
lookup = function(full_match, word){
if(word == 'hello')
return "bonjour";
if(word == 'sir')
return "monsieur"
};
console.log(value.replace(/#\[(.+?)\]/gi, lookup))
Result:
bonjour, monsieur, nice to see you.
Of course when your replacement list gets bigger, you'd better use lookup object instead of series of ifs in lookup function, but you can really do whatever you want there.
You can try this to find all occurrences:
var re = new RegExp('#\\[([^\\]]+?)\\]', 'gi'),
str = '#[value1] plain text #[value2]',
match;
while (match = re.exec(str)) {
console.log(match);
}
You could use something like:
#\\[[^\\]]*\\]
Which matches the hash followed by an opening square bracket followed by zero or more characters NOT including the closing square bracket, followed by a closed square bracket.
Alternatively, perhaps it would be better to handle the translation at the server side (maybe even through your template engine) and send back to your client the translated response. Otherwise, (depending on the specific problem you are dealing with of course), you might end up sending a lot of data to the browser which might make your application respond slowly.
EDIT:
Here is a working piece of code:
var q="This #[ANIMAL1] was eaten by that #[ANIMAL2]";
var u = {"#[ANIMAL1]":"Lion","#[ANIMAL2]":"Frog"};
function insertAnimal(aString, lookup){
var res = (new RegExp("#\\[[^\\]]*\\]", "gi"))
while (m = res.exec(aString)){
aString = aString.replace(m, lookup[m])
}
return aString;
}
function main(){
alert(insertAnimal(q,u));
}
You can call the "main()" from an HTML document's body onload event
I can compare your requirement to 'resolving template texts within content'. If it is feasible to use Jquery , you should try Handlebars.js
.
Need help! I've been looking for a solution for this seemingly simple task but can't find an exact one. Anyway, I'm trying to add custom #id to the tag based on the page's URL. The script I'm using works ok when the URLs are like these below.
- http://localhost.com/index.html
- http://localhost.com/page1.html
- http://localhost.com/page2.html
-> on this level, <body> gets ids like #index, #page1, #page2, etc...
My question is, how can I make the body #id still as #page1 or #page2 even when viewing subpages like this?
- http://localhost.com/page1/subpage1
- http://localhost.com/page2/subpage2
Here's the JS code I'm using (found online)
$(document).ready(function() {
var pathname = window.location.pathname;
var getLast = pathname.match(/.*\/(.*)$/)[1];
var truePath = getLast.replace(".html","");
if(truePath === "") {
$("body").attr("id","index");
}
else {
$("body").attr("id",truePath);
}
});
Thanks in advance!
edit: Thanks for all the replies! Basically I just want to put custom background images on every pages based on their body#id. >> js noob here.
http://localhost.com/page2/subpage2 - > my only problem is how to make the id as #page2 and not #subpage2 on this link.
Using the javascript split function might be of help here. For example (untested, but the general idea):
var url = window.location.href.replace(/http[s]?:\/\//, '').replace('.html', '');
var segments = url.split('/');
$('body').id = segments[0];
Also, you might want to consider using classes instead of ID's. This way you could assign every segment as a class...
var url = window.location.href.replace(/http[s]?:\/\//, '').replace('.html', '');
var segments = url.split('/');
for (var i = 0; i < segments.length; i++) {
$('body').addClass(segments[i]);
}
EDIT:
Glad it worked. Couple of notes if you're planning on using this for-real: If you ever have an extension besides .html that will get picked up in the class name. You can account for this by changing that replace to a regex...
var url = window.location.href.replace(/http[s]?:\/\//, '');
// Trim extension
url = url.replace(/\.(htm[l]?|asp[x]?|php|jsp)$/,'');
If there will ever be querystrings on the URL you'll want to filter those out too (this is the one regex I'm not 100% on)...
url = url.replace(/\?.+$/,'');
Also, it's a bit inefficient to have the $('body') in every for loop "around" as this causes jQuery to have to re-find the body tag. A more performant way to do this, especially if the sub folders end up 2 or 3 deep would be to find it once, then "cache" it to a variable like so..
var $body = $('body');
for ( ... ) {
$body.addClass( ...
}
Your regex is only going to select the last part of the url.
var getLast = pathname.match(/./(.)$/)[1];
You're matching anything (.*), followed by a slash, followed by anything (this time, capturing this value) and then pulling out the first match, which is the only match.
If you really want to do this (and I have my doubts, this seems like a bad idea) then you could just use window.location.pathname, since that already has the fullpath in there.
edit: You really shouldn't need to do this because the URL for the page is already a unique identifier. I can't really think of any situation where you'd need to have a unique id attribute for the body element on a page. Anytime where you're dealing with that content (either from client side javascript, or from a scraper) you should already have a unique identifier - the URL.
What are you actually trying to do?
Try the following. Basically, it sets the id to whatever folder or filename appears after the domain, but won't include a file extension.
$(document).ready(function() {
$("body").attr("id",window.location.pathname.split("/")[1].split(".")[0]);
}
You want to get the first part of the path instead of the last:
var getFirst = pathname.match(/^\/([^\/]*)/)[1];
If your pages all have a common name as in your example ("page"), you could modify your script including changing your match pattern to include that part:
var getLast = pathname.match(/\/(page\d+)\//)[1];
The above would match "page" followed by a number of digits (omitting the 'html' ending too).
Why would the below eliminate the whitespace around matched keyword text when replacing it with an anchor link? Note, this error only occurs in Chrome, and not firefox.
For complete context, the file is located at: http://seox.org/lbp/lb-core.js
To view the code in action (no errors found yet), the demo page is at http://seox.org/test.html. Copy/Pasting the first paragraph into a rich text editor (ie: dreamweaver, or gmail with rich text editor turned on) will reveal the problem, with words bunched together. Pasting it into a plain text editor will not.
// Find page text (not in links) -> doxdesk.com
function findPlainTextExceptInLinks(element, substring, callback) {
for (var childi= element.childNodes.length; childi-->0;) {
var child= element.childNodes[childi];
if (child.nodeType===1) {
if (child.tagName.toLowerCase()!=='a')
findPlainTextExceptInLinks(child, substring, callback);
} else if (child.nodeType===3) {
var index= child.data.length;
while (true) {
index= child.data.lastIndexOf(substring, index);
if (index===-1 || limit.indexOf(substring.toLowerCase()) !== -1)
break;
// don't match an alphanumeric char
var dontMatch =/\w/;
if(child.nodeValue.charAt(index - 1).match(dontMatch) || child.nodeValue.charAt(index+keyword.length).match(dontMatch))
break;
// alert(child.nodeValue.charAt(index+keyword.length + 1));
callback.call(window, child, index)
}
}
}
}
// Linkup function, call with various type cases (below)
function linkup(node, index) {
node.splitText(index+keyword.length);
var a= document.createElement('a');
a.href= linkUrl;
a.appendChild(node.splitText(index));
node.parentNode.insertBefore(a, node.nextSibling);
limit.push(keyword.toLowerCase()); // Add the keyword to memory
urlMemory.push(linkUrl); // Add the url to memory
}
// lower case (already applied)
findPlainTextExceptInLinks(lbp.vrs.holder, keyword, linkup);
Thanks in advance for your help. I'm nearly ready to launch the script, and will gladly comment in kudos to you for your assistance.
It's not anything to do with the linking functionality; it happens to copied links that are already on the page too, and the credit content, even if the processSel() call is commented out.
It seems to be a weird bug in Chrome's rich text copy function. The content in the holder is fine; if you cloneContents the selected range and alert its innerHTML at the end, the whitespaces are clearly there. But whitespaces just before, just after, and at the inner edges of any inline element (not just links!) don't show up in rich text.
Even if you add new text nodes to the DOM containing spaces next to a link, Chrome swallows them. I was able to make it look right by inserting non-breaking spaces:
var links= lbp.vrs.holder.getElementsByTagName('a');
for (var i= links.length; i-->0;) {
links[i].parentNode.insertBefore(document.createTextNode('\xA0 '), links[i]);
links[i].parentNode.insertBefore(document.createTextNode(' \xA0), links[i].nextSibling);
}
but that's pretty ugly, should be unnecessary, and doesn't fix up other inline elements. Bad Chrome!
var keyword = links[i].innerHTML.toLowerCase();
It's unwise to rely on innerHTML to get text from an element, as the browser may escape or not-escape characters in it. Most notably &, but there's no guarantee over what characters the browser's innerHTML property will output.
As you seem to be using jQuery already, grab the content with text() instead.
var isDomain = new RegExp(document.domain, 'g');
if (isDomain.test(linkUrl)) { ...
That'll fail every second time, because global regexps remember their previous state (lastIndex): when used with methods like test, you're supposed to keep calling repeatedly until they return no match.
You don't seem to need g (multiple matches) here... but then you don't seem to need regexp here either as a simple String indexOf would be more reliable. (In a regexp, each . in the domain would match any character in the link.)
Better still, use the URL decomposition properties on Location to do a direct comparison of hostnames, rather than crude string-matching over the whole URL:
if (location.hostname===links[i].hostname) { ...
// don't match an alphanumeric char
var dontMatch =/\w/;
if(child.nodeValue.charAt(index - 1).match(dontMatch) || child.nodeValue.charAt(index+keyword.length).match(dontMatch))
break;
If you want to match words on word boundaries, and case insensitively, I think you'd be better off using a regex rather than plain substring matching. That'd also save doing four calls to findText for each keyword as it is at the moment. You can grab the inner bit (in if (child.nodeType==3) { ...) of the function in this answer and use that instead of the current string matching.
The annoying thing about making regexps from string is adding a load of backslashes to the punctuation, so you'll want a function for that:
// Backslash-escape string for literal use in a RegExp
//
function RegExp_escape(s) {
return s.replace(/([/\\^$*+?.()|[\]{}])/g, '\\$1')
};
var keywordre= new RegExp('\\b'+RegExp_escape(keyword)+'\\b', 'gi');
You could even do all the keyword replacements in one go for efficiency:
var keywords= [];
var hrefs= [];
for (var i=0; i<links.length; i++) {
...
var text= $(links[i]).text();
keywords.push('(\\b'+RegExp_escape(text)+'\\b)');
hrefs.push[text]= links[i].href;
}
var keywordre= new RegExp(keywords.join('|'), 'gi');
and then for each match in linkup, check which match group has non-zero length and link with the hrefs[ of the same number.
I'd like to help you more, but it's hard to guess without being able to test it, but I suppose you can get around it by adding space-like characters around your links, eg. .
By the way, this feature of yours that adds helpful links on copying is really interesting.
I need to replace some text that is on the page within the body tag. I am using javascript but have jquery available if needed. I basically need to replace test® (test with the registered trademark) with TEST® or tests® with TESTS® and it could even be test with TEST® or tests with TESTS®. I am able to uppercase them but its not liking to work for me with the ® sign, it wants to put duplicates on ones that already have it. Basically anything on the page that has the word test or tests should be TEST® or TESTS® if it is plural. Any help is appreciated.
EDIT:
So now I have this:
var html = $('body').html();
var html = html.replace(/realtor(s)?(®)?/gi, function(m, s1, s2){
var s = s1?s1.toUpperCase():"";
var reg = s2?s2:'®';
return "REALTOR"+s+reg;
});
$('body').html(html);
Its working well other than it is duplicating the ® on the ones that already had them any ideas on how not to?
As others have already said, you will not be able to match the ®, you need to match on
\u00ae.
The code you provided needs to be changed to:
var html = $('body').html();
var html = html.replace(/realtor(s)?(\u00ae)?/gi, function(m, s1, s2){
var s = s1?s1.toUpperCase():"";
var reg = s2?s2:'®';
return "REALTOR"+s+reg;
});
$('body').html(html);
To expand on jAndy's answer, try this:
$("div, p, span").each(function(){
o = $(this);
o.html( o.text().replace(/test(|s)\u00ae/gi, function($1){
return($1.toUpperCase());
}));
});
Using the code you provided, try this:
$(document).ready(function(){
$('body').html( $('body').html().replace(/realtor(|s)\u00ae/gi, function($1){
return($1.toUpperCase() );
}));
})
Instead of creating something from scratch try using an alternate library. I develop with PHP so using a library that has identical methods in JavaScript is a life saver.
PHP.JS Library
var newReplaced = $P.str_replace("find","replace",varSearch);
The tricky part here is to match the ®, which is a Unicode character, I guess...
Have you tried the obvious?
var newStr = str.replace(/test(s)?®?/gi, function(m, s1){
var s = s1?s1.toUpperCase():"";
return "TEST"+s+"®";
});
If the problem is that ® does not match, try with its unicode character number:
/test(s)?\u00ae/
Sorry if the rest does not work, I assume your replacement already works and you just have to also match the ® so that it does not get duplicated.