Regex if substring exist then match another part of the string - javascript

I am using the YUI3 library and am using a filter to match and replace parts of a URL.
Because filter is not very flexible, I am only able to provide a regex expression for searching and then a string for replacing the matches:
filter: {
searchExp : "-min\\.js",
replaceStr: "-debug.js"
}
In my case, I have a URL that looks like this:
http://site.com/assets/js?yui-3.9.0/widget-base/assets/skins/sam/widget-base.css&yui-3.9.0/cssbutton/cssbutton-min.css
I would like to match /assets/js if there are .css files. If the parameters contain a CSS file, then it will always only contain CSS files.
So far, I have written a small regex to check for the presence of .css at the very end:
.*\.css$
However, now, if we have a match, I would like to return /assets/js as the match. Is this something that is doable with regex?
Personally, I would rather this be done with a simple function and a simple if/else, but due to the limitations (I can only use regex), I need to find a regex solution to this.

This is a bit hacked together, but should do the job:
var t = new RegExp( "/assets/js(([^\\.]*\\.)*[^\\.]*\\.css)$" )
document.write( "http://site.com/assets/js?yui-3.9.0/widget-base/assets/skins/sam/widget-base.css&yui-3.9.0/cssbutton/cssbutton-min.css".replace( t, "/newthing/$1" ) );
Essentially it searches for /assets/js, followed by any characters, followed by .css. If the whole thing matches it wil replace it with the new text, and include the matched pattern (from the first brackets) after it. Everything from before /assets isn't included in the match, so doesn't need to be included.
I imagine your library uses replace internally, so those strings should work. Specifically,
"/assets/js(([^\\.]*\\.)*[^\\.]*\\.css)$"
"/newthing/$1"
I'm not quite sure what you want to do with the results, but this allows you to change the folder and add suffixes (as well as check for the presence of both tokens in the first place). To add a suffix change the replacement to this:
"/assets/js$1-mysuffix"

Related

How to search/match regx in HTML content using javascript/jQuery?

I have an regex and i need to extract data from document.body.outerHTML i tried various methods like .search() , .match() & .exec() but it didn't work i need to extract data from whole document.body.outerHTML. My class is as below.
class executionPrompt {
constructor(){
this.regex = /(B[0-9]{2})/
this.data = [];
}
start(){
var html = jQuery(document.body.outerHTML);
var matches = jQuery(html).html();
console.log(matches.match(this.regex));
}
}
i have whole document and regex but i am not able to get array for my matching string. Please help me.
From this answer
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. ...
i as commented by #Wiktor Stribiżew i have change my mind and as of now i have requested to server to grabs id then get its index from tree,
FYI: to match all occurrences, a g modifier is required, /B[0-9]{2}/g may be seems an acceptable answer thumbs up for all.

Get all the WORDS except one specific word

I want to get all the words, except one, from a string using JS regex match function. For example, for a string testhello123worldtestWTF, excluding the word test, the result would be helloworldWTF.
I realize that I have to do it using look-ahead functions, but I can't figiure out how exactly. I came up with the following regex (?!test)[a-zA-Z]+(?=.*test), however, it work only partially.
http://refiddle.com/refiddles/59511c2075622d324c090000
IMHO, I would try to replace the incriminated word with an empty string, no?
Lookarounds seem to be an overkill for it, you can just replace the test with nothing:
var str = 'testhello123worldtestWTF';
var res = str.replace(/test/g, '');
Plugging this into your refiddle produces the results you're looking for:
/(test)/g
It matches all occurrences of the word "test" without picking up unwanted words/letters. You can set this to whatever variable you need to hold these.
WORDS OF CAUTION
Seeing that you have no set delimiters in your inputted string, I must say that you cannot reliably exclude a specific word - to a certain extent.
For example, if you want to exclude test, this might create a problem if the input was protester or rotatestreet. You don't have clear demarcations of what a word is, thus leading you to exclude test when you might not have meant to.
On the other hand, if you just want to ignore the string test regardless, just replace test with an empty string and you are good to go.

trouble using string.replace with regex

Given something a regex like this:
http://rubular.com/r/ai1LFT5jvK
I want to use string.replace to replace "subdir" with a string of my choosing.
Doing myStr.replace(/^.*\/\/.*\.net\/.*\/(.*)\/.*\z/,otherStr)
only returns the same string, as shown here: http://jsfiddle.net/nLmbV/
If you view the Rublar, it appears to capture what I want it to capture, but on the Fiddle, it doesn't replace it.
I'd like to know why this happens, and what I'm doing wrong. A correct regex or a correct implementation of the replace call would be nice, but most of all, I want to understand what I'm doing wrong so that I can avoid it in the future.
EDIT
I've updated the fiddle to change my regex from:
/^.*\/\/.*\.net\/.*\/(.*)\/.*\z/
to
/^.*\/\/.*\.net\/.*\/(.*)\/.*$/
And according to the fiddle, it just returns hello instead of https://xxxxxxxxxxx.cloudfront.net/dir/hello/Slide1_v2.PNG
It's that little \z in your regex.
You probably forgot to replace it with a $ sign. JavaScript uses ^ and $ as anchors, while Ruby uses \A and \z.
To answer your edit:
The match is always replaced as a whole. You'll want to group both the left side and the right side of the to-be-replaced part and reinsert it in the replacement:
url.replace(/^(.*\/\/.*\.net\/.*\/).*(\/.*)$/,"$1hello$2")
Before I get marked down, I know the question asks about regexp. The reason for this answer URLs are nearly impossible to process reliably with a regexp without writing fiendishly complex regexps. It can be done, but it makes your head hurt!
If you are doing this in a browser, you can use an A tag in your script to make things much simpler. The A tag knows how to parse them into pieces, and it lets you modify the pieces independently, so you only need to deal with the pathname:
//make a temporary a tag
var a = document.createElement('a');
//set the href property to the url you want to process
a.href = "scheme://host.domain/path/to/the/file?querystring"
//grab the path part of the url, and chop up into an array of directories
var dirs = a.pathname.split('/');
//set 2nd dir name - array is ['','path','to','file']
dirs[2]='hello';
//put the path back together
a.pathname = dirs.join('/');
a.href now contains the URL you want.
More lines, but also more hair left when you come back to change the code later.

What's wrong with this regular expression to find URLs?

I'm working on a JavaScript to extract a URL from a Google search URL, like so:
http://www.google.com/search?client=safari&rls=en&q=thisisthepartiwanttofind.org&ie=UTF-8&oe=UTF-8
Right now, my code looks like this:
var checkForURL = /[\w\d](.org)/i;
var findTheURL = checkForURL.exec(theURL);
I've ran this through a couple regex testers and it seems to work, but in practice the string I get returned looks like this:
thisisthepartiwanttofind.org,.org
So where's that trailing ,.org coming from?
I know my pattern isn't super robust but please don't suggest better patterns to use. I'd really just like advice on what in particular I did wrong with this one. Thanks!
Remove the parentheses in the regex if you do not process the .org (unlikely since it is a literal). As per #Mark comment, add a + to match one or more characters of the class [\w\d]. Also, I would escape the dot:
var checkForURL = /[\w\d]+\.org/i;
What you're actually getting is an array of 2 results, the first being the whole match, the second - the group you defined by using parens (.org).
Compare with:
/([\w\d]+)\.org/.exec('thisistheurl.org')
→ ["thisistheurl.org", "thisistheurl"]
/[\w\d]+\.org/.exec('thisistheurl.org')
→ ["thisistheurl.org"]
/([\w\d]+)(\.org)/.exec('thisistheurl.org')
→ ["thisistheurl.org", "thisistheurl", ".org"]
The result of an .exec of a JS regex is an Array of strings, the first being the whole match and the subsequent representing groups that you defined by using parens. If there are no parens in the regex, there will only be one element in this array - the whole match.
You should escape .(DOT) in (.org) regex group or it matches any character. So your regex would become:
/[\w\d]+(\.org)/
To match the url in your example you can use something like this:
https?://([0-9a-zA-Z_.?=&\-]+/?)+
or something more accurate like this (you should choose the right regex according to your needs):
^https?://([0-9a-zA-Z_\-]+\.)+(com|org|net|WhatEverYouWant)(/[0-9a-zA-Z_\-?=&.]+)$

Javascript string replace of dots (.) within filenames

I'm trying to parse and amend some html (as a string) using javascript and in this html, there are references (like img src or css backgrounds) to filenames which contain full stops/periods/dots/.
e.g.
<img src="../images/filename.01.png"> <img src="../images/filename.02.png">
<div style="background:url(../images/file.name.with.more.dots.gif)">
I've tried, struggled and failed to come up with a neat regex to allow me to parse this string and spit it back out without the dots in those filenames, e.g.
<img src="../images/filename01.png"/> <img src="../images/filename02.png"/>
<div style="background:url(../images/filenamewithmoredots.gif)">
I only want to affect the image filenames, and obviously I want to leave the filetype alone.
A regex like:
/(.*)(?=(.gif|.png|.jpg|.jpeg))
allows me to match the main part of the filename and the extension seperately, but it also matches across the whole of the string, not just within the one filename I want.
I have no control over the incoming html, I'm just consuming it.
Help me please overflowers, you're my only hope!
I agree that this is not a problem suitable for regular expression, much less one neat expression.
But I trust that you are not here to hear that. So, in case you want to keep the input as string...
var src, result = '<img src="../images/filename.01.png"> <img src="../images/filename.02.png"><div style="background:url(../images/file.name.with.more.dots.gif)">';
do {
src = result;
result = src.replace( /((?:url(\()|href=|src=)['"]?(?:[^'"\/]*\/)*[^'"\/]*)\.(?=[^\.'")]*\.(?:gif|png|jpe?g)['")>}\s])/g, '$1' );
} while (result != src)
Basically it keeps removing the second last dot of images url's filenames until there are none. Here is a breakdown of the expression in case you need to modify it. Tread lightly:
( start main capturing group since js regx has no lookbehind.
(?:url(\()|href=|src=)['"]? Start of an url. it would be safer to force url() to be properly quoted so that we can use back reference, but unfortunately your given example is not.
(?:[^'"\/]*\/)* Folder part of the url.
[^'"\/]* Part of the file name that comes before second last dot.
) close main group.
\. This is the second last dot we want to get rid of.
(?= Look behind.
[^\.'")]* Part of the file name that goes between second last dot and last dot.
\.(?:gif|png|jpe?g) Make sure the url ends in image extension.
['")>}\s] Closing the url, which can be a quote, ')', '>', '}', or spaces. Should user back reference here if possible. (Was ['"]?\b when first answered)
) End of look behind.
Consider using the DOM instead of regular expressions. One way is to create fake elements.
var fake = document.createElement('div');
fake.innerHTML = incomingHTML: // Not really part of JS standard but all the 'main' browsers support it
var background = fake.childNodes[0].style.background;
// Now use a regex if need be: /url\(\"?(.*)\"?\)/
// If img is at childNodes[1]
var url = fake.childNodes[1].src;
With jQuery this is far easier:
$(incomingHTML).find('img').each(function() { $(this).attr('src'); });
Your problem is the greedy match in .*. Maybe better try something like this
([^\/]*)(?=(.gif|.png|.jpg|.jpeg))
[^\/] is a character class that matches every character but slashes
another point is, you need to escape the . to match it literally
([^\/]*)(?=\.(gif|png|jpg|jpeg))
The problem is that . means "any character".
Escape it:
/(.*)(?=(\.gif|\.png|\.jpg|\.jpeg))

Categories