I'm prototyping some text processing to prep research data for coding, and I've got a javascript replace statement the bombs in jsFiddle and I cannot figure out why:
mE[1] = mE[1].replace(/<p.*>/ig, ''); // <<< this line
I'm trying to remove any opening paragraph tag.
If you look at http://jsfiddle.net/jotarkon/2e5gq/, uncomment that line and see that it the script fails.
-- click on the Heading to fire the funciton
This is driving me nuts. any ideas what's going wrong?
The problem appears to be an actual illegal character somewhere in that line, and I don't think it has anything to do with the regex. Try typing the whole line in from scratch and delete that one. When I do that, the fiddle works fine (well, it doesn't get that error at least).
edit — the illegal character is right after the semicolon on that line. Starting from the "//" on your "this line" comment, hit backspace a few times to erase the bogus character and the semicolon, then re-type the semicolon.
edit some more - The characters are the sequence C2 AD (hex).
First of all, don't use regexen for HTML. There are libraries available for that. You can't parse HTML with regexen. Second, you need to be more specific. Saying "a replace statement the bombs" tells us nothing about the nature of the error. Finally, in case you're curious, that regex is greedy, so it will replace everything from the first HTML tag that starts with the letter p until the very last > in your input indiscriminately. If you really want to use that, make it non-greedy and make sure it doesn't match other tags that start with the letter p. I'm not going to be specific because doing that is the Wrong Answer.
Related
i have a very malformed xml file and i have to do a couple of fixes before parsing it.
In details, i have to replace a number of cdatas (open and close) between 2 given tags, and i'd like to do it with one regex.
I have some like:
<tag>data...<!CDATA[...]]>...otherdata..!<CDATA[....]>>....</tag>
What i would like to is replace all of the occurencies of cdata (start and stop, so <!CDATA[ and ]>>) betwenn the tag with nothing, removing them.
Thanks a lot!
EDIT 1:
I have thousands of files. I have a regex that extract the content of the tag, i.e.
(<tag>)(?!<\/tag>)(.)*(<\/tag>)
but i cannot think of a way to insert a check inside the group, something like:
^(!<CDATA[|]]>)*
That's a fairly nasty bit of corruption you've got in your file; it looks like CDATA is malformed in several different ways. This catches all of the errors you've described:
<tag>.*?\K((?:<!|!<)CDATA\[.*?\]+>+)(?=.*<\/tag>)
This regex checks that the string starts with <tag>, gets text up to the "start" of your CDATA tag, and then uses \K to throw all of that away. Then, it looks for ! and < in any order, followed by CDATA[ and any text inside. Next comes as many ] or > as we can find, though always at least one of each. The final bit of the regex is a lookahead to make sure the closing tag is present. Try it here!
Note that this will only match one malformed tag per line. In order to get them all, it's likely you'll need to run a replacement with this regex a few times. Once the regex has no more matches, you can be sure you're free of malformed tags... or at least, tags with the mutations you've described in your question.
As an aside, if you want to keep all "properly formatted" CDATA tags, the regex gets WAY uglier:
<tag>.*?\K(?!<!CDATA\[[^\n\]]*\]>(?:[^>]|$))((?:<!|!<)CDATA\[.*?\]+>+)(?=.*<\/tag>)
This includes a lookahead to assert you're not matching a "properly formatted" CDATA tag (here described as <!CDATA[...]>). This one runs really slow if the start <tag> does not have a closing <tag> that matches, so if that's an issue in your file(s), be warned. Try it here!
Good luck!
::head
line 1
line 2
line 3
::content
content 1
content 2
content 3
How do I get "head" paragraph(first part) text with regex? This is from txt file.
Unfortunately, the below doesn't work in javascript because of this: Javascript regex multiline flag doesn't work. So we have to tweak things a bit. A line break in a file can be found in javascript strings as \n. In windows this includes \r but not in linux, so our \s* becomes more important now that we're doing this without using line-ending characters ($). I also noticed that you don't need to specifically gather the other lines, since line breaks are being ignored anyway.
/(::head[^]*?)\n\s*\n/m
This works in testing in Chrome, so it should work for your needs.
this is a little fancy, but it should fit if this is used in conjunction with many similar properties.
/(::head.*?$^.*?$)^\s*$/m
Note that you need the /m multiline flag.
Here it is tested against your sample data http://rubular.com/r/vtflEgDdkY
First, we check for the ::head data. That's where we start collecting information in a group with (). Then we look for anything with .*, but we do so with the lazy ? flag. Then we find the end of the line with $ and look for more lines with data with the line start ^ then anything .*? then the line end $ this will grab multiple lines because of the multiline flag, so it's important to use the lazy matching ? so we don't grab too much data. Then we look for an empty line. Normally you just need ^$ for that, but I wanted to make sure this would work if someone had stuck a stray space or tab on the lines in between sections, so we used \s* to grab spaces. The * allows it to find "0 or more" spaces as acceptable. Notice we didn't include the empty line in the group () because that's not the data you care about.
For further reading on regex, I recommend http://www.regular-expressions.info/tutorial.html It's where I learned everything I know about regex.
You can use [\s\S]+::content to match everything until ::content:
const text = ...
const matches = text.match(/^([\s\S]+)::content/m)
const content = matches[1]
In received email, the first line is of the form:
** New update from 'Doug, Mon And Monta - Test Group': reply above this line to comment **
When somebody replies, on the server side I'm able to strip out those incoming lines with a simple indexOf() check.
The problem is that some mail clients (such as my own Apple Mail) add additional text above that line when replying of the form:
On Dec 28, 2012, at 10:19 AM, "XYZ Communities - Doug, Mon And Monta -
Test Group" wrote:
I tried trapping that with a regular expression like this:
var rx1 = new RegExp('on.*wrote:', 'ig');
While this works in most cases, it unfortunately also catches cases where a person might reply with text containing "on" earlier on, such as:
At that site, I think what we are interested in is the AgroTagger
service described on this page where...
Under certain circumstances the "on" in the above text is found and everything after that gets trimmed by my code.
I tried to narrow the scope of the regular expression by including the begin-of-line character and adding the multiline modifier like this:
var rx1 = new RegExp('^on.*wrote:', 'igm');
But in that case the line is not found at all and is included with the text. I guess the ^ metacharacter for beginning of a line doesn't really work for a line in the middle of a JavaScript string?
Anyway, any suggestions would be appreciated. Basically, I'm trimming out the "reply above" line ok using a few variations of indexOf(). What I need is an extra check after that for the case where a mail client adds more unneeded text above that line.
Thanks,
doug
p.s. If anybody can tell me how I can receive email notifications when replies are posted here I would be very greatful. Nothing I've tried works so far.
For the text you describe, this is working in my tests...
emailBody = emailBody.replace(new RegExp("^On .+ wrote:$[.\\r\\n]*", "im"), "");
... which is very much like your regex - so I'm not sure why your's did not match at all.
I did notice in some of the emails in my inbox that there is whitespace at the beginnings of the lines. Perhaps this is the problem you're having? In that case, this would fix it:
emailBody = emailBody.replace(new RegExp("^\\s*On .+ wrote:\\s*$[.\\r\\n]*", "im"), "");
I am building a table, with content pulled from other elements in the page (page scraping).
I am using innerText or textContent to pull the text, then a regular expression to trim it:
string.replace(/^\s+|\s+$/g,"");
This works fine in IE 9 and Chrome, but in IE 8 I am getting a garbage character that I cannot identify. I was able to reproduce the behavior with alerts in jsfiddle:
http://jsfiddle.net/Te4FQ/
What is this extra character, and how can I get rid of it?
Update: thanks for the helpful replies! It seems that the character in question is u200E (left to right mark). So the second part of my question remains, how can I get rid of such characters with regular expressions, and just keep regular text?
Both the "At Risk" and "Complete" <th> tags in your jsFiddle snippet have a U+200E (Left-to-Right Mark, aka LRM) code point at the end of their content. That is not a whitespace character, so it cannot be matched by \s.
One way to get rid of this character is to use the XRegExp library, so that you can replace all matches of \p{C} with the empty string (i.e., delete them). \p{C} matches any code point in Unicode's "Other" category, which includes control, format, private use, surrogate, and unassigned code points. U+200E, specifically, is within the \p{Cf} "Other, Format" subcategory.
Try printing to the page the result of
escape(string.replace(/^\s+|\s+$/g,""));
Your garbage character should show up as an escape code.
i've already read all tha articles in here wich touch a similar problem but still don't get any solution working. In my case i wanna wrap each word of a string with a span. The words contain special characters like 'äüö...'
What i am doing at the moment is:
var textWrap = text.replace(/\b([a-zA-Z0-9ßÄÖÜäöüÑñÉéÈèÁáÀàÂâŶĈĉĜĝŷÊêÔôÛûŴŵ-]+)\b/g, "<span>$1</span>");
But what happens is that if the äüñ or whatever NON-Ascii character is at the end or at the beginning it also acts like a boundary. Being within a word these characters do't act as a boundary.
'Ärmelkanal' becomes Ä<span>rmelkanal</span> but should be <span>Ärmelkanal</span>
'Käse'works fine... becomes <span>Käse</span>
'diré' becomes <span>dir</span>é but should be <span>diré</span>
Any advice would be very appreciated. I need to do that on clientside :-( BTW did i mention that i hate regular expressions ;-)
Thank You very much!
The problem is that JavaScript recognizes word boundaries only before/after ASCII letters (and numbers/underscore). Just drop the \b anchors and it should work.
result = subject.replace(/[a-zA-Z0-9ßÄÖÜäöüÑñÉéÈèÁáÀàÂâŶĈĉĜĝŷÊêÔôÛûŴŵ-]+/g, "<span>$&</span>");