I'm trying to parse and amend some html (as a string) using javascript and in this html, there are references (like img src or css backgrounds) to filenames which contain full stops/periods/dots/.
e.g.
<img src="../images/filename.01.png"> <img src="../images/filename.02.png">
<div style="background:url(../images/file.name.with.more.dots.gif)">
I've tried, struggled and failed to come up with a neat regex to allow me to parse this string and spit it back out without the dots in those filenames, e.g.
<img src="../images/filename01.png"/> <img src="../images/filename02.png"/>
<div style="background:url(../images/filenamewithmoredots.gif)">
I only want to affect the image filenames, and obviously I want to leave the filetype alone.
A regex like:
/(.*)(?=(.gif|.png|.jpg|.jpeg))
allows me to match the main part of the filename and the extension seperately, but it also matches across the whole of the string, not just within the one filename I want.
I have no control over the incoming html, I'm just consuming it.
Help me please overflowers, you're my only hope!
I agree that this is not a problem suitable for regular expression, much less one neat expression.
But I trust that you are not here to hear that. So, in case you want to keep the input as string...
var src, result = '<img src="../images/filename.01.png"> <img src="../images/filename.02.png"><div style="background:url(../images/file.name.with.more.dots.gif)">';
do {
src = result;
result = src.replace( /((?:url(\()|href=|src=)['"]?(?:[^'"\/]*\/)*[^'"\/]*)\.(?=[^\.'")]*\.(?:gif|png|jpe?g)['")>}\s])/g, '$1' );
} while (result != src)
Basically it keeps removing the second last dot of images url's filenames until there are none. Here is a breakdown of the expression in case you need to modify it. Tread lightly:
( start main capturing group since js regx has no lookbehind.
(?:url(\()|href=|src=)['"]? Start of an url. it would be safer to force url() to be properly quoted so that we can use back reference, but unfortunately your given example is not.
(?:[^'"\/]*\/)* Folder part of the url.
[^'"\/]* Part of the file name that comes before second last dot.
) close main group.
\. This is the second last dot we want to get rid of.
(?= Look behind.
[^\.'")]* Part of the file name that goes between second last dot and last dot.
\.(?:gif|png|jpe?g) Make sure the url ends in image extension.
['")>}\s] Closing the url, which can be a quote, ')', '>', '}', or spaces. Should user back reference here if possible. (Was ['"]?\b when first answered)
) End of look behind.
Consider using the DOM instead of regular expressions. One way is to create fake elements.
var fake = document.createElement('div');
fake.innerHTML = incomingHTML: // Not really part of JS standard but all the 'main' browsers support it
var background = fake.childNodes[0].style.background;
// Now use a regex if need be: /url\(\"?(.*)\"?\)/
// If img is at childNodes[1]
var url = fake.childNodes[1].src;
With jQuery this is far easier:
$(incomingHTML).find('img').each(function() { $(this).attr('src'); });
Your problem is the greedy match in .*. Maybe better try something like this
([^\/]*)(?=(.gif|.png|.jpg|.jpeg))
[^\/] is a character class that matches every character but slashes
another point is, you need to escape the . to match it literally
([^\/]*)(?=\.(gif|png|jpg|jpeg))
The problem is that . means "any character".
Escape it:
/(.*)(?=(\.gif|\.png|\.jpg|\.jpeg))
Related
I'm trying to make a regular expression that finds the tagnames and attributes of elements. For example, if I have this:
<div id="anId" class="aClass">
I want to be able to get an array that looks like this:
["(full match)", "div", "id", "anId", "class", "aClass"]
Currently, I have the regex /<(\S*?)(?: ?(.*?)="(.*?)")*>/, but for some reason it skips over every attribute except for the last one.
var str = '<div id="anId" class="aClass">'
console.log(str.match(/<(\S*)(?: ?(.*?)="(.*?)")*>/));
Regex101: https://regex101.com/r/G0ncwF/2
Another odd thing: if I remove the * after the non-capture group, the capture group in quotes seems to somehow "forget" that it's lazy. (Regex101: https://regex101.com/r/C0UwI8/2)
Why does this happen, and how can I avoid it? I couldn't find any questions/answers that helped me (Python re.finditer match.groups() does not contain all groups from match looked promising, but didn't seem help me at all)
(note: I know there are better ways to get the attributes, I'm just experimenting with regex)
UPDATE:
I've figured out at least why the quantifiers seem to "forget" that they're lazy. It's actually just that the regex is trying to match all the way to the angle brackets. I suppose I must have been thinking that the non-capturing group was "insulating" everything and preventing that from happening, and I didn't see it was still lazy because there was only one angle bracket for it to find.
var str = '"foo" "bar"> "baz>"'
console.log("/\".*?\"/ produces ", str.match(/".*?"/), ", finds first quote, finds text, lazily stops at second quote");
console.log("/\".*?\">/ produces ", str.match(/".*?">/), ", finds first quote, finds text, sees second quote but doesn't see angle bracket, keeps going until it sees \">, lazily stops");
So at least that's solved. But I still don't understand why it skips over every attribute but the last one.
And note: Other regexes using different tricks to find the attributes are nice and all, but I'm mostly looking to learn why my regex skips over the attributes, so I can maybe understand regex a bit better.
Playing along with your experimentation you could do this: Instead of scanning for what you want, you can scan for what you don't want, and then filter it out:
const html = '<div id="anId" class="aClass">';
const regex = /[<> ="]/;
let result = html.split(regex).filter(Boolean);
console.log('result: '+JSON.stringify(result));
Output:
result: ["div","id","anId","class","aClass"]
Explanation:
regex /[<> ="]/ lists all chars you don't want
.split(regex) splits your text along the unwanted chars
.filter(Boolean) gets rid of the unwanted chars
Mind you this has flaws, for example it will split incorrectly for html <div id="anId" class="aClass anotherClass">, e.g a space in an attribute value. To support that you could preprocess the html with another regex to escape spaces in quotes, then postprocess with another regex to restore the spaces...
Yes, an HTML parser is more reliable for these kind of tasks.
Given something a regex like this:
http://rubular.com/r/ai1LFT5jvK
I want to use string.replace to replace "subdir" with a string of my choosing.
Doing myStr.replace(/^.*\/\/.*\.net\/.*\/(.*)\/.*\z/,otherStr)
only returns the same string, as shown here: http://jsfiddle.net/nLmbV/
If you view the Rublar, it appears to capture what I want it to capture, but on the Fiddle, it doesn't replace it.
I'd like to know why this happens, and what I'm doing wrong. A correct regex or a correct implementation of the replace call would be nice, but most of all, I want to understand what I'm doing wrong so that I can avoid it in the future.
EDIT
I've updated the fiddle to change my regex from:
/^.*\/\/.*\.net\/.*\/(.*)\/.*\z/
to
/^.*\/\/.*\.net\/.*\/(.*)\/.*$/
And according to the fiddle, it just returns hello instead of https://xxxxxxxxxxx.cloudfront.net/dir/hello/Slide1_v2.PNG
It's that little \z in your regex.
You probably forgot to replace it with a $ sign. JavaScript uses ^ and $ as anchors, while Ruby uses \A and \z.
To answer your edit:
The match is always replaced as a whole. You'll want to group both the left side and the right side of the to-be-replaced part and reinsert it in the replacement:
url.replace(/^(.*\/\/.*\.net\/.*\/).*(\/.*)$/,"$1hello$2")
Before I get marked down, I know the question asks about regexp. The reason for this answer URLs are nearly impossible to process reliably with a regexp without writing fiendishly complex regexps. It can be done, but it makes your head hurt!
If you are doing this in a browser, you can use an A tag in your script to make things much simpler. The A tag knows how to parse them into pieces, and it lets you modify the pieces independently, so you only need to deal with the pathname:
//make a temporary a tag
var a = document.createElement('a');
//set the href property to the url you want to process
a.href = "scheme://host.domain/path/to/the/file?querystring"
//grab the path part of the url, and chop up into an array of directories
var dirs = a.pathname.split('/');
//set 2nd dir name - array is ['','path','to','file']
dirs[2]='hello';
//put the path back together
a.pathname = dirs.join('/');
a.href now contains the URL you want.
More lines, but also more hair left when you come back to change the code later.
I am using the YUI3 library and am using a filter to match and replace parts of a URL.
Because filter is not very flexible, I am only able to provide a regex expression for searching and then a string for replacing the matches:
filter: {
searchExp : "-min\\.js",
replaceStr: "-debug.js"
}
In my case, I have a URL that looks like this:
http://site.com/assets/js?yui-3.9.0/widget-base/assets/skins/sam/widget-base.css&yui-3.9.0/cssbutton/cssbutton-min.css
I would like to match /assets/js if there are .css files. If the parameters contain a CSS file, then it will always only contain CSS files.
So far, I have written a small regex to check for the presence of .css at the very end:
.*\.css$
However, now, if we have a match, I would like to return /assets/js as the match. Is this something that is doable with regex?
Personally, I would rather this be done with a simple function and a simple if/else, but due to the limitations (I can only use regex), I need to find a regex solution to this.
This is a bit hacked together, but should do the job:
var t = new RegExp( "/assets/js(([^\\.]*\\.)*[^\\.]*\\.css)$" )
document.write( "http://site.com/assets/js?yui-3.9.0/widget-base/assets/skins/sam/widget-base.css&yui-3.9.0/cssbutton/cssbutton-min.css".replace( t, "/newthing/$1" ) );
Essentially it searches for /assets/js, followed by any characters, followed by .css. If the whole thing matches it wil replace it with the new text, and include the matched pattern (from the first brackets) after it. Everything from before /assets isn't included in the match, so doesn't need to be included.
I imagine your library uses replace internally, so those strings should work. Specifically,
"/assets/js(([^\\.]*\\.)*[^\\.]*\\.css)$"
"/newthing/$1"
I'm not quite sure what you want to do with the results, but this allows you to change the folder and add suffixes (as well as check for the presence of both tokens in the first place). To add a suffix change the replacement to this:
"/assets/js$1-mysuffix"
I can't post the exact data i'm trying to extract but here's a basic scenario with the same outcome. I'm grabbing the body of a page and trying to extract a bit.ly link from it. So let's say for example, this is the chunk of data where i'm trying to grab the link from.
String:
http://bit.ly/Pq8AkS</div><div class="shareUnit"><div class="-cx-PRIVATE-fbTimelineExternalShareUnit__wrapper"><div><div class="-cx-PRIVATE-fbTimelineExternalShareUnit__root -cx-PRIVATE-fbTimelineExternalShareUnit__hasImage"><a class="-cx-PRIVATE-fbTimelineExternalShareUnit__video -cx-PRIVATE-fbTimelineExternalShareUnit__image -cx-PRIVATE-fbTimelineExternalShareUnit__content" ajaxify="/ajax/flash/expand_inline.php?target_div=uikk85_59&share_id=271663136271285&max_width=403&max_height=403&context=timelineSingle" rel="async" href="#" onclick="CSS.addClass(this, "-cx-PRIVATE-fbTimelineExternalShareUnit__loading");CSS.removeClass(this, "-cx-PRIVATE-fbTimelineExternalShareUnit__video");"><i class="-cx-PRIVATE-fbTimelineExternalShareUnit__play"></i><img class="img" src="http://external.ak.fbcdn.net/safe_image.php?d=AQDoyY7_wjAyUtX2&w=155&h=114&url=http%3A%2F%2Fi1.ytimg.com%2Fvi%2FDre21lBu2zU%2Fmqdefault.jpg" alt="" /></a>
Now, I can get what i'm looking for with the following code but the link isn't always going to be exactly 6 characters long. So this causes an issue...
Body = document.getElementsByTagName("body")[0].innerHTML;
regex = /2Fbit.ly%2F(.{6})&h/g;
Matches = regex.exec(Body);
Here's what I was orginally trying but the problem I have is that it grabs too much data. It's going all the way to the last "&h" in the string above instead of stopping at the first one it hits.
Body = document.getElementsByTagName("body")[0].innerHTML;
regex = /2Fbit.ly%2F(.*)&h/g;
Matches = regex.exec(Body);
So basically the main part of the string i'm trying to focus on is "%2Fbit.ly%2FPq8AkS&h" so that I can get the "Pq8AkS" out of it. When I use the (.*) it's grabbing everything between "%2F" and the very last "&h" in the large string above.
You should not be using a regex on HTML. Use DOM functions to get the desired link object, then get the href attribute from that, then you can use a regex on just the href.
By default .* is greedy meaning that it matches the most it can match and still find a match. If you want it to be non-greedy (match the least possible), you can use this .*? instead like this:
regex = /2Fbit.ly%2F(.*?)&h/;
I also don't think you want the g flag on the regex as there should only be one match in the right URL.
If you show the rest of your HTML, we could offer advice on finding the right link object rather than trying to match the entire body HTML.
FYI, another trick for a non-greedy match is to do something like this:
regex = /2Fbit.ly%2F([^&]*)&h/;
Which matches a series of characters that are not & followed by &h which accomplishes the same goal as long as & can't be in the matched sequence.
By default + and * are greedy and match as much as possible. You need a non-greedy match for your (.+). A quick search gives the solution as
? directly following a quantifier makes the quantifier non-greedy (makes it match minimum instead of maximum of the interval defined).
So try changing your regex= line to
regex = /2Fbit.ly%2F(.*?)&h/g;
Edit: #jfriend00's answer below is more complete.
ok i do have this following data in my div
<div id="mydiv">
<!--
what is your present
<code>alert("this is my present");</code>
where?
<code>alert("here at my left hand");</code>
oh thank you! i love you!! hehe
<code>alert("welcome my honey ^^");</code>
-->
</div>
well what i need to do there is to get the all the scripts inside the <code> blocks and the html codes text nodes without removing the html comments inside. well its a homework given by my professor and i can't modify that div block..
I need to use regular expressions for this and this is what i did
var block = $.trim($("div#mydiv").html()).replace("<!--","").replace("-->","");
var htmlRegex = new RegExp(""); //I don't know what to do here
var codeRegex = new RegExp("^<code(*n)</code>$","igm");
var code = codeRegex.exec(block);
var html = "";
it really doesn't work... please don't give the exact answer.. please teach me.. thank you
I need to have the following blocks for the variable code
alert("this is my present");
alert("here at my left hand");
alert("welcome my honey ^^");
and this is the blocks i need for variable html
what is your present
where?
oh thank you! i love you!! hehe
my question is what is the regex pattern to get the results above?
Parsing HTML with a regular expression is not something you should do.
I'm sure your professor thinks he/she was really clever and that there's no way to access the DOM API and can wave a banner around and justify some minor corner-case for using regex to parse the DOM and that sometimes it's okay.
Well, no, it isn't. If you have complex code in there, what happens? Your regex breaks, and perhaps becomes a security exploit if this is ever in production.
So, here:
http://jsfiddle.net/zfp6D/
Walk the dom, get the nodeType 8 (comment) text value out of the node.
Invoke the HTML parser (that thing that browsers use to parse HTML, rather than regex, why you wouldn't use the HTML parser to parse HTML is totally beyond me, it's like saying "Yeah, I could nail in this nail with a hammer, but I think I'm going to just stomp on the nail with my foot until it goes in").
Find all the CODE elements in the newly parsed HTML.
Log them to console, or whatever you want to do with them.
First of all, you should be aware that because HTML is not a regular language, you cannot do generic parsing using regular expressions that will work for all valid inputs (generic nesting in particular cannot be expressed with regular expressions). Many parsers do use regular expressions to match individual tokens, but other algorithms need to be built around them
However, for a fixed input such as this, it's just a case of working through the structure you have (though it's still often easier to use different parsing methods than just regular expressions).
First lets get all the code:
var code = '', match = [];
var regex = new RegExp("<code>(.*?)</code>", "g");
while (match = regex.exec(content)) {
code += match[1] + "\n";
}
I assume content contains the content of the div that you've already extracted. Here the "g" flag says this is for "global" matching, so we can reuse the regex to find every match. The brackets indicate a capturing group, . means any character, * means repeated 0 or more times, and ? means "non-greedy" (see what happens without it to see what it does).
Now we can do a similar thing to get all the other bits, but this time the regex is slightly more complicated:
new RegExp("(<!--|</code>)(.*?)(-->|<code>)", "g")
Here | means "or". So this matches all the bits that start with either "start comment" or "end code" and end with "end comment" or "start code". Note also that we now have 3 sets of brackets, so the part we want to extract is match[2] (the second set).
You're doing a lot of unnecessary stuff. .html() gives you the inner contents as a string. You should be able to use regEx to grab exactly what you need from there. Also, try to stick with regEx literals (e.g. /^regexstring$/). You have to escape escape characters using new RegExp which gets really messy. You generally only want to use new RegExp when you need to put a string var into a regEx.
The match function of strings accepts regEx and returns a collection of every match when you add the global flag (e.g. /^regexstring$/g <-- note the 'g'). I would do something like this:
var block = $('#mydiv').html(), //you can set multiple vars in one statement w/commas
matches = block.match(/<code>[^<]*<\/code>/g);
//[^<]* <-- 0 or more characters that aren't '<' - google 'negative character class'
matches.join('_') //lazy way of avoiding a loop - join into a string with a safe character
.replace(/<\/*code>/g,'') //\/* 0 or more forward slashes
.split('_');//return the matches string back to array
//Now do what you want with matches. Eval (ew) or append in a script tag (ew).
//You have no control over the 'ew'. I just prefer data to scripts in strings