Getting all occurences in a string with Javascript Regex - javascript

First of all I am not an expert on JavaScript, in fact I am newbie.
I know PHP and there are functions to get all occurences of a regex pattern preg_match() and preg_match_all().
In the internet I found many resources that shows how to get all occurences in a string. But when I do several regex matches, it looks ugly to me.
This is what I found in the internet:
var fileList = []
var matches
while ((matches = /<item id="(.*?)" href="(.*?)" media-type="(?:.*?)"\/>/g.exec(data)) !== null) {
fileList.push({id: matches[1], file: matches[2]})
}
fileOrder = []
while ((matches = /<itemref idref="(.*?)"\/>/g.exec(data)) !== null) {
fileOrder.push({id: matches[1]})
}
Is there a more elegant way other than this code?

Using regexes on html is generally held to be a bad idea, because regexes lack sufficient power to reliably match a^n b^n arbitrarily nested occurrences such as balanced parens or HTML/XML open/close tags. Its also trivially easy to get data out of the DOM in JavaScript without treating it like a string, that's what the DOM is for. For example:
let mapOfIDsToFiles = Array.from(document.querySelectorAll('item'))
.reduce((obj, item) => {
obj[item.id] = item.href;
return obj;
}, {});
This has the added advantage of being much faster, simpler, and more robust. DOM access is slow, but you'll be accessing the DOM anyway to get the HTML you run your regexes over.
Modifying built-in prototypes like String.prototype is generally held to be a bad idea, because it can cause random breakages with third-party code that defines the same function but differently, or if the JavaScript standard gets updated to include that function but it works differently.
UPDATE
If the data is already a string, you can easily turn it into a DOM element without affecting the page:
let elem = document.createElement('div')
div.innerHTML = data;
div.querySelectorAll('item'); // gives you all the item elements
As long as you don't append it to the document, its just a JavaScript object in memory.
UPDATE 2
Yes, this also works for XML but converting it to DOM is slightly more complicated:
// define the function differently if IE, both do the same thing
let parseXML = (typeof window.DOMParser != null && typeof window.XMLDocument != null) ?
xml => ( new window.DOMParser() ).parseFromString(xml, 'text/xml') :
xml => {
let xmlDoc = new window.ActiveXObject('Microsoft.XMLDOM');
xmlDoc.async = "false";
xmlDoc.loadXML(xml);
return xmlDoc;
};
let xmlDoc = parseXML(data).documentElement;
let items = Array.from(xmlDoc.querySelectorAll('item'));
Note that if the parse fails (i.e. your document was malformed) then you will need to check for the error document like so:
// check for error document
(() => {
let firstTag = xmlDoc.firstChild.firstChild;
if (firstTag && firstTag.tagName === 'parsererror') {
let message = firstTag.children[1].textContent;
throw new Error(message);
}
})();

I came up with the idea of creating a method in String.
I wrote a String.prototype that simplyfy things for me:
String.prototype.getMatches = function(regex, callback) {
var matches = []
var match
while ((match = regex.exec(this)) !== null) {
if (callback)
matches.push(callback(match))
else
matches.push(match)
}
return matches
}
Now I can get all matches with more elegant way. Also it's resembles preg_match_all() function of PHP.
var fileList = data.getMatches(/<item id="(.*?)" href="(.*?)" media-type="(?:.*?)"\/>/g, function(matches) {
return {id: matches[1], file: matches[2]}
})
var fileOrder = data.getMatches(/<itemref idref="(.*?)"\/>/g, function(matches) {
return matches[1]
})
I hope this helps you too.

Related

How to find JSON string in HTML file

I'm trying to find a plaintext JSON within a webpage, using Javascript. The JSON will appear as plaintext as seen in the browser, but it is possible that it would be truncated into separate html tags. Example:
<div>
{"kty":"RSA","e":"AQAB","n":"mZT_XuM9Lwn0j7O_YNWN_f7S_J6sLxcQuWsRVBlAM3_5S5aD0yWGV78B-Gti2MrqWwuAhb_6SkBlOvEF8-UCHR_rgZhVR1qbrxvQLE_zpamGJbFU_c1Vm8hEAvMt9ZltEGFS22BHBW079ebWI3PoDdS-DJvjjtszFdnkIZpn4oav9fzz0
</div>
<div>
xIaaxp6-qQFjKXCboun5pto59eJnn-bJl1D3LloCw7rSEYQr1x5mxhIxAFVVsNGuE9fjk0ueTDcMUbFLPYn6PopDMuN0T1B2D1Y8ClItEVbVDFb-mRPz8THJ_gexJ8C20n8m-pBlpL4WyyPuY2ScDugmfG7UnBGrDmS5w"}
</div>
I've tried to use this RegEx.
{"?\w+"?:[^}<]+(?:(?:(?:<\/[^>]+>)[^}<]*(?:<[^>]+>)+)*[^}<]*)*}
But the problem is it fails to work with nested JSON.
I may also use javascript to count the number of { and } to find where the JSON actually ends, but there must be better options than using this slow and clumsy approach.
Many thanks
Update:
Perhaps there ain't better way to do this. Below is my current code (a bit verbose but probably needed):
let regex = /{[\s\n]*"\w+"[\s\n]*:/g;
// Consider both open and close curly brackets
let brackets = /[{}]/g;
let arr0, arr;
// Try to parse every matching JSON
arr0 = match.exec(body);
if (arr0 === null) { // Nothing found
return new Promise(resolve => resolve());
}
try {
brackets.lastIndex = match.lastIndex; // After beginning of current JSON
let count = 1;
// Count for { and } to find the end of JSON.
while ((count !== 0) && ((arr = brackets.exec(body)) !== null)) {
count += (arr[0] === "{" ? 1 : -1);
}
// If nothing special, complete JSON found when count === 0;
let lastIdx = brackets.lastIndex;
let json = body.substring(match.lastIndex - arr0[0].length, lastIdx);
try {
let parsed = JSON.parse(json);
// Process the JSON here to get the original message
} catch (error) {
console.log(err);
}
...
} catch(err) {
console.log(err);
};
That's not possible in a good way, it might be possible to take a parent element's innerText and parse that:
console.log(JSON.parse(document.getElementById('outer').innerText.replace(/\s|\n/g, '')));
<div id="outer">
<div>
{"kty":"RSA","e":"AQAB","n":"mZT_XuM9Lwn0j7O_YNWN_f7S_J6sLxcQuWsRVBlAM3_5S5aD0yWGV78B-Gti2MrqWwuAhb_6SkBlOvEF8-UCHR_rgZhVR1qbrxvQLE_zpamGJbFU_c1Vm8hEAvMt9ZltEGFS22BHBW079ebWI3PoDdS-DJvjjtszFdnkIZpn4oav9fzz0
</div>
<div>
xIaaxp6-qQFjKXCboun5pto59eJnn-bJl1D3LloCw7rSEYQr1x5mxhIxAFVVsNGuE9fjk0ueTDcMUbFLPYn6PopDMuN0T1B2D1Y8ClItEVbVDFb-mRPz8THJ_gexJ8C20n8m-pBlpL4WyyPuY2ScDugmfG7UnBGrDmS5w"}
</div>
</div>
But it's likely to fail sometimes

Parse through a string to create an array of substrings

I am building a mini search engine on my website that can search for words and has filters.
I need to be able to take a long string, and split it up into an array of smaller substrings. The words (with no filter) should go in one string, and then each filter should go in a separate string. The order of words and filters should not matter.
For example:
If my string is:
"hello before: 01/01/17 after: 01/01/2015"
OR:
"before: 01/01/17 hello after: 01/01/2015"
I would expect my function to return (in any order):
["hello", "before: 01/01/2017", "after: 01/01/2015"]
You could use whitespace and a positive lookahead for splitting.
console.log("hello before: 01/01/17 after: 01/01/2015".split(/\s*(?=before|after)/));
Are there any specific limitations for code size? I mean, this isn't code-golf or anything, so why not just do it the straight-forward way?
First, you can tokenize this with a simple regular expression
var search_string = "hello before: 01/01/17 after: 01/01/2015";
var regex = /(?:(before|after)\:\s*)?([^ ]*)/g
var token = null;
while ((token = regex.exec(search_string)) != null) {
Then, you can put the arrange them into any data structure you want. For example, we can put the filters into a separate object, as so:
var filters = {};
var words = [];
//...
if (token[1])
filters[token[1]] = token[2];
else
words.push(token[2]);
After that, you can manipulate these structures any way you want
words.sort();
if (filters['before']) words.push(filters['before']);
if (filters['after']) words.push(filters['after']);
return words;
I'm not sure why you'd want it arranged this way, but this would make things uniform. Alternately, you can use them in a more straightforward way:
var before = Date.parse(filters['before'] || '') || false;
if (before !== false) before = new Date(before);
var after = Date.parse(filters['after'] || '') || false;
if (after !== false) before = new Date(before);
function isDocumentMatchSearch(doc) {
if (before !== false && doc.date > before) return false;
if (after !== false && doc.date < after) return false;
for (var i = 0; i < words.length; i++) {
if (doc.title.indexOf(words[i]) < 0 &&doc.text.indexOf(words[i]) < 0) return false;
}
return true;
}
Since you didn't give a lot of information on what you're searching through, what data types or storage type it's stored in, etc etc, that's the best I can offer.

Strip unwanted tag in a string (no JQuery)

I have a string that contains the following:
<span>A</span>BC<span id="blabla">D</span>EF
i want to be able to use the JavaScript replace function with regx expression to only remove the spans that do not have an id. So that the result would look like
ABC<span id="blabla">D</span>EF
I am really not interested in using jQuery. I would rather use pure JavaScript to solve the problem. I have the following but it does not seem to properly work
myText.replace(/(<([^>]+)>)/ig,"");
Any help would be appreciated!
Use the DOM, not a regexp.
var input = '<span>A</span>BC<span id="blabla">D</span>EF',
output,
tempElt = document.createElement('span');
tempElt.innerHTML = input;
// http://www.quirksmode.org/dom/w3c_html.html#t03
if (tempElt.innerText) output = tempElt.innerText;
else output = tempElt.textContent;
console.log(output); // "ABCDEF"
Demo: http://jsfiddle.net/mattball/Ctrkf/
"It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail."
Something like this will do the job, but it doesn't use a regular expression (but it doesn't use jQuery either, so one out of two ain't bad).
var s = '<span>A</span>BC<span id="blabla">D</span>EF';
function removeSpans(s) {
var a = document.createElement('div');
var b = a.cloneNode(true);
a.innerHTML = s;
var node;
while (a.firstChild) {
node = a.removeChild(a.firstChild);
if (node.tagName &&
node.tagName.toLowerCase() == 'span' &&
node.id == '') {
b.appendChild(document.createTextNode(getText(node)));
} else {
b.appendChild(node);
}
}
return b.innerHTML;
}
alert(removeSpans(s));
It's not particularly robust (actually it works better than I thought it would) and will likely fail in cases slightly different to the test case. But it shows the strategy.
Here's another version, pretty similar though:
function removeSpans2(s) {
var a = document.createElement('div');
a.innerHTML = s;
var node, next = a.firstChild;
while (node = next) {
next = next.nextSibling
if (node.tagName && node.tagName.toLowerCase() == 'span' && !node.id) {
a.replaceChild(document.createTextNode(node.textContent || node.innerText), node);
}
}
return a.innerHTML;
}

Regular expression to get class name with specific substring

I need a regular expression in javascript that will get a string with a specific substring from a list of space delimited strings.
For example, I have;
widget util cookie i18n-username
I want to be able to return only i18n-username.
How
You could use the following function, using a regex to match for your string surrounded by either a space or the beginning or end of a line. But you'll have to be careful about preparing any regular expression special characters if you plan to use them, since the search argument will be interpreted as a string instead of a RegExp literal:
var hasClass = function(s, klass) {
var r = new RegExp("(?:^| )(" + klass + ")(?: |$)")
, m = (""+s).match(r);
return (m) ? m[1] : null;
};
hasClass("a b c", "a"); // => "a"
hasClass("a b c", "b"); // => "b"
hasClass("a b c", "x"); // => null
var klasses = "widget util cookie i18n-username";
hasClass(klasses, "username"); // => null
hasClass(klasses, "i18n-username"); // => "i18n-username"
hasClass(klasses, "i18n-\\w+"); // => "i18n-username"
As others have pointed out, you could also simply use a "split" and "indexOf":
var hasClass = function(s, klass) {
return (""+s).split(" ").indexOf(klass) >= 0;
};
However, note that the "indexOf" function was introduced to JavaScript somewhat recently, so for older browsers you might have to implement it yourself.
var hasClass = function(s, klass) {
var a=(""+s).split(" "), len=a.length, i;
for (i=0; i<len; i++) {
if (a[i] == klass) return true;
}
return false;
};
[Edit]
Note that the split/indexOf solution is likely faster for most browsers (though not all). This jsPerf benchmark shows which solution is faster for various browsers - notably, Chrome must have a really good regular expression engine!
function getString(subString, string){
return (string.match(new RegExp("\S*" + subString + "\S*")) || [null])[0];
}
To Use:
var str = "widget util cookie i18n-username";
getString("user", str); //returns i18n-username
Does this need to be a regex? Would knowing if the string existed be sufficient? Regular expressions are inefficient (slower) and should be avoided if possible:
var settings = 'widget util cookie i18n-username',
// using an array in case searching the string is insufficient
features = settings.split(' ');
if (features.indexOf('i18n-username') !== -1) {
// do something based on having this feature
}
If whitespace wouldn't cause an issue in searching for a value, you could just search the string directly:
var settings = 'widget util cookie i18n-username';
if (settings.indexOf('i18n-username') !== -1) {
// do something based on having this value
}
It then becomes easy to make this into a reusable function:
(function() {
var App = {},
features = 'widget util cookie i18n-username';
App.hasFeature = function(feature) {
return features.indexOf(feature) !== -1;
// or if you prefer the array:
return features.split(' ').indexOf(feature) !== -1;
};
window.App = App;
})();
// Here's how you can use it:
App.hasFeature('i18n-username'); // returns true
EDIT
You now say you need to return all strings that start with another string, and it is possible to do this with a regular expression as well, although I am unsure about how efficient it is:
(function() {
var App = {},
features = 'widget util cookie i18n-username'.split(' ');
// This contains an array of all features starting with 'i18n'
App.i18nFeatures = features.map(function(value) {
return value.indexOf('i18n') === 0;
});
window.App = App;
})();
/i18n-\w+/ ought to work. If your string has any cases like other substrings can start with i18n- or your user names have chars that don't fit the class [a-zA-Z0-9_], you'll need to specify that.
var str = "widget util cookie i18n-username";
alert(str.match(/i18n-\w+/));
Edit:
If you need to match more than one string, you can add on the global flag (/g) and loop through the matches.
var str = "widget i18n-util cookie i18n-username";
var matches = str.match(/i18n-\w+/g);
if (matches) {
for (var i = 0; i < matches.length; i++)
alert(matches[i]);
}
else
alert("phooey, no matches");

Help refactor a small piece of Javascript code which identifies user's referrer source

I've written the following small piece of javascript (Based on the excellent parseURI function) to identify where the user originated from. I am new to Javascript, and although the code below works, was wondering if there is a more efficient method of achieving this same result?
try {
var path = parseUri(window.location).path;
var host = parseUri(document.referrer).host;
if (host == '') {
alert('no referrer');
}
else if (host.search(/google/) != -1 || host.search(/bing/) != -1 || host.search(/yahoo/) != -1) {
alert('Search Engine');
}
else {
alert('other');
}
}
catch(err) {}
You can simplify the host check using alternative searches:
else if (host.search(/google|bing|yahoo/) != -1 {
I'd also be tempted to test document referrer before extracting the host for your "no referrer" error.
(I've not tested this).
I end up defining a function called set in a lot of my projects. It looks like this:
function set() {
var result = {};
for (var i = 0; i < arguments.length; i++)
result[arguments[i]] = true;
return result;
}
Once you've got the portion of the hostname that you're looking for...
// low-fi way to grab the domain name without a regex; this assumes that the
// value before the final "." is the name that you want, so this doesn't work
// with .co.uk domains, for example
var domain = parseUri(document.referrer).host.split(".").slice(-2, 1)[0];
...you can elegantly test your result against a list using JavaScript's in operator and the set function we defined above:
if (domain in set("google", "bing", "yahoo"))
// do stuff
More info:
http://laurens.vd.oever.nl/weblog/items2005/setsinjavascript/

Categories