replace similar string in a text using javascript regex - javascript

we have a text like:
this is a test :rep more text more more :rep2 another text text qweqweqwe.
or
this is a test :rep:rep2 more text more more :rep2:rep another text text qweqweqwe. (without space)
we should replace :rep with TEXT1 and :rep2 with TEXT2.
problem:
when try to replace using something like:
rgobj = new RegExp(":rep","gi");
txt = txt.replace(rgobj,"TEXT1");
rgobj = new RegExp(":rep2","gi");
txt = txt.replace(rgobj,"TEXT2");
we get TEXT1 in both of them because :rep2 is similar with :rep and :rep proccess sooner.

If you require that :rep always end with a word boundary, make it explicit in the regex:
new RegExp(":rep\\b","gi");
(If you don't require a word boundary, you can't distinguish what is meant by "hello I got :rep24 eggs" -- is that :rep, :rep2, or :rep24?)
EDIT:
Based on the new information that the match strings are provided by the user, the best solution is to sort the match strings by length and perform the replacements in that order. That way the longest strings get replaced first, eliminating the risk that the beginning of a long string will be partially replaced by a shorter substring match included in that long string. Thus, :replongeststr is replaced before :replong which is replaced before :rep .

If your data is always consistent, replace :rep2 before :rep.
Otherwise, you could search for :rep\s, searching for the space after the keyword. Just make sure you replace the space as well.

Related

Get all the WORDS except one specific word

I want to get all the words, except one, from a string using JS regex match function. For example, for a string testhello123worldtestWTF, excluding the word test, the result would be helloworldWTF.
I realize that I have to do it using look-ahead functions, but I can't figiure out how exactly. I came up with the following regex (?!test)[a-zA-Z]+(?=.*test), however, it work only partially.
http://refiddle.com/refiddles/59511c2075622d324c090000
IMHO, I would try to replace the incriminated word with an empty string, no?
Lookarounds seem to be an overkill for it, you can just replace the test with nothing:
var str = 'testhello123worldtestWTF';
var res = str.replace(/test/g, '');
Plugging this into your refiddle produces the results you're looking for:
/(test)/g
It matches all occurrences of the word "test" without picking up unwanted words/letters. You can set this to whatever variable you need to hold these.
WORDS OF CAUTION
Seeing that you have no set delimiters in your inputted string, I must say that you cannot reliably exclude a specific word - to a certain extent.
For example, if you want to exclude test, this might create a problem if the input was protester or rotatestreet. You don't have clear demarcations of what a word is, thus leading you to exclude test when you might not have meant to.
On the other hand, if you just want to ignore the string test regardless, just replace test with an empty string and you are good to go.

Extracting both the full match, and the last token match in a regexp

I have a little interesting issue here. I have a plaintext URL coming from Excel and I need to change it to an HTML URL with a unique body. Here is the regex code for javascript:
text = text.toString().replace(/=hyperlink\(([#\\\w\s\(\)-\.\/]+)\)/g, "<a href='file:///$1'>$1</a>");
This works perfectly fine for what it does. Example, text is:
=hyperlink("\\share\folder\log\2013\13-05-13\13-05-13.txt")
regex turns it into
\\share\folder\log\2013\13-05-13\13-05-13.txt
However, I need the inner HTML to be just the text file name:
13-05-13.txt
To further complicate the matter, the original text the regex is going through is not a single occurrence. It is an entire spreadsheet with 100's of rows that contain this. So the regex will be matching and replacing 100's of these strings in one operation.
Hopefully it is possible to get this all done in one regexp on the entire string, but I suppose I could loop through each line of the string first...
If there is no way to do this with one regex engine, what do you think the best approach is? (no PHP/Python/Server side. Just Javascript, HTML, Jquery, etc).
I guess you could use this regex:
=hyperlink\("([#\\\w\s\(\)\-\.\/]+\\([^"]+))"\)
And this new replace:
$2
I'm not sure how your regex was working, but I added the quotes in the regex and replaced the single quotes by double quotes in the replace. Revert those if need be.
Demo

getElementById replace HTML

<script type="text/javascript">
var haystackText = document.getElementById("navigation").innerHTML;
var matchText = 'Subscribe to RSS';
var replacementText = '<ul><li>Some Other Thing Here</li></ul>';
var replaced = haystackText.replace(matchText, replacementText);
document.getElementById("navigation").innerHTML = replaced;
</script>
I'm attempting to try and replace a string of HTML code to be something else. I cannot edit the code directly, so I'm using Javascript to alter the code.
If I use the above method Matching Text on a regular string, such as just 'Subscribe to RSS', I can replace it fine. However, once I try to replace an HTML string, the code 'fails'.
Also, what if the HTML I wish to replace contains line breaks? How would I search for that?
<ul><li>\n</li></ul>
??
What should I be using or doing instead of this? Or am I just missing a small step? I did search around here, but maybe my keywords for the search weren't optimal to find a result that fit my situation...
Edit: Gonna mention, I'm writing this script in the footer of my page, well after the text I wish to replace, so it's not an issue of the script being written before what I want to overwrite to appear. :)
Currently you are using String.replace(substring, replacement) that will search for an exact match of the substring and replace it with the replacement e.g.
"Hello world".replace("world", "Kojichan") => "Hello Kojichan"
The problem with exact matches is that it doesn't allow anything else but exact matches.
To solve the problem, you'll have to start to use regular expressions. When using regular expression you have to be aware of
special characters such as ?, /, and \ that need to escaped \?, \/, \\
multiline mode /regexp/m
global matching if you want to replace more than one instance of the expression /regexp/g
closures for allowing multiple instances of white space \s+ for [1..n] white-space characters and \s* for [0..n] white-space characters.
To use regular expression instead of substring matching you just need to change String.replace("substring", "replacement") to String.replace(/regexp/, "replacement") e.g.
"Hello world".replace(/world/, "Kojichan") => "Hello Kojichan"
From MDN:
Note: If a <div>, <span>, or <noembed> node has a child text node that
includes the characters (&), (<), or (>), innerHTML returns these
characters as &amp, &lt and &gt respectively. Use element.textContent
to get a correct copy of these text nodes' contents.
So since textContent (or innerText) won't get you the HTML, you'd have to modify your search string appropriately.
You can use Regular Expressions.
Recommend to use Regular Expression. Notice that ? and / are special characters in Regular Expression. And for global multi-line matching, you need g and m flags set in the regular expression.
Regular expression matching of HTML (other than plain text) that comes out of a web page is a bad idea and is troublesome to make work cross browser (particularly in IE). The HTML that comes out of a web page does not always look the same as what was put in because some browser reconstitute the HTML and don't actually store what went in. Attributes can change order, quote marks can change or disappear, entities can change, etc...
If you want to modify whole tags, then you should directly access the DOM and operate on the actual objects in the page.

Can't use javascript regex to get everything between html/xml tags

So I receive some xml in plaintext (and no I can't use DOM or JSON because apparently I am not allowed to), I want to strip all elements encased in a certain element and put them into an array, where I can strip out the text in the individual segments.
Now I am used to using POSIX regex and I will never actually understand the point behind PCRE regex, nor do I get the syntax.
Now here is the code I am using:
var strResponse = objResponse.text;
var strRegex = new RegExp("<item>(.*?)<\/item>","i");
var arrMatches = "";
var match;
while (match = strRegex.exec(strResponse)) {
arrMatches[] = match[1];
}
I have no idea why it won't find any matches with this code, can someone please help me on this and perhaps elaborate on what exactly it is I am continuously doing wrong with the PCRE syntax?
If those tags are in different rows the . will not match the newline characters and therefor your expression will not match. This is just a guess, I don't know your source.
You can try
var strRegex = new RegExp("<item>([\\s\\S]*?)<\\/item>","i");
[\\s\\S] is a character class. containing all whitespace and all non whitespace characters. linebreaks are covered by the whitespace characters.
The best way to complete this task is using the following, to parse it as proper HTML and navigate it with the DOM parser:
Javascript function to parse HTML string into DOM?
Regex has it with being very faulty and is in general not very good for parsing irregular text like HTML structure.

Trying to remove trailing text

I having the following code. I want to extract the last text (hello64) from it.
<span class="qnNum" id="qn">4</span><span>.</span> hello64 ?*
I used the code below but it removes all the integers
questionText = questionText.replace(/<span\b.*?>/ig, "");
questionText=questionText.replace(/<\/span>/ig, "");
questionText = questionText.replace(/\d+/g,"");
questionText = questionText.replace("*","");
questionText = questionText.replace(". ",""); i want to remove the first integer, and need to keep the rest of the integers
It's the third line .replace(/\d+/g,"") which is replacing the integers. If you want to keep the integers, then don't replace \d+, because that matches one or more digits.
You could achieve most of that all on one line, by the way - there's no need to have multiple replaces there:
var questionText = questionText.replace(/((<span\b.*?>)|(<\/span>)|(\d+))/ig, "");
That would do the same as the first three lines of your code. (of course, you'd need to drop the |(\d+) as per the first part of the answer if you didn't want to get rid of the digits.
[EDIT]
Re your comment that you want to replace the first integer but not the subsequent ones:
The regex string to do this would depend very heavily on what the possible input looks like. The problem is that you've given us a bit of random HTML code; we don't know from that whether you're expecting it to always be in this precise format (ie a couple of spans with contents, followed by a bit at the end to keep). I'll assume that this is the case.
In this case, a much simpler regex for the whole thing would be to replace eveything within <span....</span> with blank:
var questionText = questionText.replace(/(<span\b.*?>.*?<\/span>)/ig, "");
This will eliminate the whole of the <span> tags plus their contents, but leave anything outside of them alone.
In the case of your example this would provide the desired effect, but as I say, it's hard to know if this will work for you in all cases without knowing more about your expected input.
In general it's considered difficult to parse arbitrary HTML code with regex. Regex is a contraction of "Regular Expressions", which is a way of saying that they are good at handling strings which have 'regular' syntax. Abitrary HTML is not a 'regular' syntax due to it's unlimited possible levels of nesting. What I'm trying to say here is that if you have anything more complex than the simple HTML snippets you've supplied, then you may be better off using a HTML parser to extract your data.
This will match the complete string and put the part after the last </span> till the next word boundary \b into the capturing group 1. You just need to replace then with the group 1, i.e. $1.
searched_string = string.replace(/^.*<\/span>\s*([A-Za-z0-9]+)\b.*$/, "$1");
The captured word can consist of [A-Za-z0-9]. If you want to have anything else there just add it into that group.

Categories