::head
line 1
line 2
line 3
::content
content 1
content 2
content 3
How do I get "head" paragraph(first part) text with regex? This is from txt file.
Unfortunately, the below doesn't work in javascript because of this: Javascript regex multiline flag doesn't work. So we have to tweak things a bit. A line break in a file can be found in javascript strings as \n. In windows this includes \r but not in linux, so our \s* becomes more important now that we're doing this without using line-ending characters ($). I also noticed that you don't need to specifically gather the other lines, since line breaks are being ignored anyway.
/(::head[^]*?)\n\s*\n/m
This works in testing in Chrome, so it should work for your needs.
this is a little fancy, but it should fit if this is used in conjunction with many similar properties.
/(::head.*?$^.*?$)^\s*$/m
Note that you need the /m multiline flag.
Here it is tested against your sample data http://rubular.com/r/vtflEgDdkY
First, we check for the ::head data. That's where we start collecting information in a group with (). Then we look for anything with .*, but we do so with the lazy ? flag. Then we find the end of the line with $ and look for more lines with data with the line start ^ then anything .*? then the line end $ this will grab multiple lines because of the multiline flag, so it's important to use the lazy matching ? so we don't grab too much data. Then we look for an empty line. Normally you just need ^$ for that, but I wanted to make sure this would work if someone had stuck a stray space or tab on the lines in between sections, so we used \s* to grab spaces. The * allows it to find "0 or more" spaces as acceptable. Notice we didn't include the empty line in the group () because that's not the data you care about.
For further reading on regex, I recommend http://www.regular-expressions.info/tutorial.html It's where I learned everything I know about regex.
You can use [\s\S]+::content to match everything until ::content:
const text = ...
const matches = text.match(/^([\s\S]+)::content/m)
const content = matches[1]
Related
I have an application which uses a Javascript-based rules engine. I need a way to convert regular straight quotes into curly (or smart) quotes. It’d be easy to just do a string.replace for ["], only this will only insert one case of the curly quote.
The best way I could think of was to replace the first occurrence of a quote with a left curly quote and every other one following with a left, and the rest right curly.
Is there a way to accomplish this using Javascript?
You could replace all that preceed a word character with the left quote, and all that follow a word character with a right quote.
str = str.replace(/"(?=\w|$)/g, "“");
str = str.replace(/(?<=\w|^)"/g, "”"); // IF the language supports look-
// behind. Otherwise, see below.
As pointed out in the comments below, this doesn't take punctuation into account, but easily can:
/(?<=[\w,.?!\)]|^)"/g
[Edit:] For languages that don't support look-behind, like Javascript, as long as you replace all the front-facing ones first, you have two options:
str = str.replace(/"/g, "”"); // Replace the rest with right curly quotes
// or...
str = str.replace(/\b"/g, "”"); // Replace any quotes after a word
// boundary with right curly quotes
(I've left the original solution above in case this is helpful to someone using a language that does support look-behind)
You might want to look at what Pandoc does—apparently with the --smart option, it handles quotes properly in all cases (including e.g. ’tis and ’twere).
I recently wrote a Javascript typography prettification engine that does, among other things, quote replacement; I wound up using basically the algorithm suggested by Renesis, but there’s currently a failing test up waiting for a smarter solution.
If you’re interested in cribbing my code (and/or submitting a patch based on work you’ve done), check it out: jsPrettify. jsprettify.prettifyStr does what you’re looking for. If you don’t want to deal with the Closure dependency, there’s an older version that runs on its own—it even works in Rhino.
'foo "foo bar" "bar"'.replace(/"([-a-zA-Z0-9 ]+)"/g, function(wholeMatch, m1){
return "“" + m1 + "”";
});
The following just changes every quote by alternating (this specific example however would leave out the orphaned quotes).
str.replace(/\"([^\"]*)\"/gi,"“$1”");
Works perfectly, as long as the text you're texturizing isn't already screwed up with improper use of the double quote. In English, quotes are never nested.
I don't think something like that in general is easy at all, because you'd have to interpret exactly what each double-quote character in your content means. That said, what I'd do is collect all the text nodes I was interested in, and then go through and keep track of the "on/off" (or "odd/even"; whatever) nature of each double quote instance. Then you can know which replacement entity to use.
I didn't find the logic I wanted here, so here's what I ended up going with.
value = value.replace(/(^|\s)(")/g, "$1“"); // replace quotes that start a line or follow spaces
value = value.replace(/"/g, "”"); // replace rest of quotes with the back smart quote
I have a small textarea that I need to replace straight quotes with curly (smart) quotes. I'm just executing this logic on keyup. I tried to make it behave like Microsoft Word.
Posting for posterity.
As suggested by #Steven Dee, I went to Pandoc.
I try to use a mature and tested tool whenever I can versus baking my own regex. Hand built regex's can be overly greedy, or not greedy enough, and they may not be sensitive to word boundaries and commas etc. Pandoc accounts for most this and more.
From the command line (the --smart parameter turns on smart quotes):
pandoc --smart --standalone -o output.html input.html
..and I know a command line script may or may not fit OP's requirement of using Javascript. (related: How to execute shell command in Javascript)
This is the file which contains merge conflicts,
<<<<<<< HEAD
$conf['some_unit_id'] = '4-qw-gg-ds-sometext';
=======
// Some Snippets Site Info
$conf['site_info'] = array(
'customer_service_phone' => '+1 323223232
'logo_path' => 'https://www.google.com/img/icons/src/logo.svg',
'currency' => 'CAD',
'https://www.youtube.com/user/somewebsite/ogog',
'https://www.instagram.com/somewebsite/',
),
);
>>>>>>> ff6df3435231fdff78fwsd83e7dffa0732eft554
// Somes code
$done['rules'] = TRUE;
Am trying to find the best regular expression that detect merge conflicts in the file. Initially I tried with :
/(<* HEAD)/
Which will detect only HEAD with some preceding <
I have some other markers as well like :
1. ======
2. >>>>> ff6df3435231fdff78fwsd83e7dffa0732eft554
These two markers must detect along with HEAD marker as well. And if a developer fixes the merge conflicts only <* HEAD and rest of the ie., ===== and >>> ff6df3435231fdff78fwsd83e7dffa0732eft554 the regular expression should detect that as well.
Since this regular expression am using in pre-commit hook. If one pattern detected in file commit will break. I need exact regex to detect merge conflict markings.
Any solution would be appreciated.
Since they're all the same length, you can use a character group:
/^[<=>]{7}( .+)?$/mg
(make sure to use a multiline regex)
You can use:
^<{7} HEAD(?:(?!={7})[\s\S])*={7}(?:(?!>{7} \w+)[\s\S])*>{7} \w+
Demo & explanation
You might also match all the lines by checking the start of each line to prevent some of the unnecessary backtracking using [\s\S].
First match the <<<<<<< HEAD part, then match all following lines that do not start with ======= and then match it.
Then match all lines that do not start with >>>>>>> followed by matching it and chars [a-z0-9].
^<{7} HEAD(?:\r?\n(?!={7}\r?\n).*)*\r?\n={7}(?:\r?\n(?!>{7} ).*)*\r?\n>{7} [a-z0-9]+
Regex demo
If you want to highlight the markers, you could use a capturing group:
^(<{7} HEAD)(?:\r?\n(?!={7}\r?\n).*)*\r?\n(={7})(?:\r?\n(?!>{7} ).*)*\r?\n(>{7} [a-z0-9]+)
Regex demo
If I understand correctly your desire, you want to find block code that need to resolve conflict. I hope my suggestion can help you.
/^<{7}\sHEAD[\s\S]+?>{7}\s\w+$/gm
Details:
Mode: multiline
^<{7}\sHEAD: block code starts with <<<<<<< HEAD
[\s\S]+?: get any character as few times as possible (line break accepted)
{7}\s\w+$: block code ends with >>>>>>> commit hash
Demo
i have a very malformed xml file and i have to do a couple of fixes before parsing it.
In details, i have to replace a number of cdatas (open and close) between 2 given tags, and i'd like to do it with one regex.
I have some like:
<tag>data...<!CDATA[...]]>...otherdata..!<CDATA[....]>>....</tag>
What i would like to is replace all of the occurencies of cdata (start and stop, so <!CDATA[ and ]>>) betwenn the tag with nothing, removing them.
Thanks a lot!
EDIT 1:
I have thousands of files. I have a regex that extract the content of the tag, i.e.
(<tag>)(?!<\/tag>)(.)*(<\/tag>)
but i cannot think of a way to insert a check inside the group, something like:
^(!<CDATA[|]]>)*
That's a fairly nasty bit of corruption you've got in your file; it looks like CDATA is malformed in several different ways. This catches all of the errors you've described:
<tag>.*?\K((?:<!|!<)CDATA\[.*?\]+>+)(?=.*<\/tag>)
This regex checks that the string starts with <tag>, gets text up to the "start" of your CDATA tag, and then uses \K to throw all of that away. Then, it looks for ! and < in any order, followed by CDATA[ and any text inside. Next comes as many ] or > as we can find, though always at least one of each. The final bit of the regex is a lookahead to make sure the closing tag is present. Try it here!
Note that this will only match one malformed tag per line. In order to get them all, it's likely you'll need to run a replacement with this regex a few times. Once the regex has no more matches, you can be sure you're free of malformed tags... or at least, tags with the mutations you've described in your question.
As an aside, if you want to keep all "properly formatted" CDATA tags, the regex gets WAY uglier:
<tag>.*?\K(?!<!CDATA\[[^\n\]]*\]>(?:[^>]|$))((?:<!|!<)CDATA\[.*?\]+>+)(?=.*<\/tag>)
This includes a lookahead to assert you're not matching a "properly formatted" CDATA tag (here described as <!CDATA[...]>). This one runs really slow if the start <tag> does not have a closing <tag> that matches, so if that's an issue in your file(s), be warned. Try it here!
Good luck!
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre.*?<\/pre>/gm );
alert(arr); // null
I'd want the PRE block be picked up, even though it spans over newline characters. I thought the 'm' flag does it. Does not.
Found the answer here before posting. SInce I thought I knew JavaScript (read three books, worked hours) and there wasn't an existing solution at SO, I'll dare to post anyways. throw stones here
So the solution is:
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[\s\S]*?<\/pre>/gm );
alert(arr); // <pre>...</pre> :)
Does anyone have a less cryptic way?
Edit: this is a duplicate but since it's harder to find than mine, I don't remove.
It proposes [^] as a "multiline dot". What I still don't understand is why [.\n] does not work. Guess this is one of the sad parts of JavaScript..
DON'T use (.|[\r\n]) instead of . for multiline matching.
DO use [\s\S] instead of . for multiline matching
Also, avoid greediness where not needed by using *? or +? quantifier instead of * or +. This can have a huge performance impact.
See the benchmark I have made: https://jsben.ch/R4Hxu
Using [^]: fastest
Using [\s\S]: 0.83% slower
Using (.|\r|\n): 96% slower
Using (.|[\r\n]): 96% slower
NB: You can also use [^] but it is deprecated in the below comment.
[.\n] does not work because . has no special meaning inside of [], it just means a literal .. (.|\n) would be a way to specify "any character, including a newline". If you want to match all newlines, you would need to add \r as well to include Windows and classic Mac OS style line endings: (.|[\r\n]).
That turns out to be somewhat cumbersome, as well as slow, (see KrisWebDev's answer for details), so a better approach would be to match all whitespace characters and all non-whitespace characters, with [\s\S], which will match everything, and is faster and simpler.
In general, you shouldn't try to use a regexp to match the actual HTML tags. See, for instance, these questions for more information on why.
Instead, try actually searching the DOM for the tag you need (using jQuery makes this easier, but you can always do document.getElementsByTagName("pre") with the standard DOM), and then search the text content of those results with a regexp if you need to match against the contents.
You do not specify your environment and version of JavaScript (ECMAScript), and I realise this post was from 2009, but just for completeness:
With the release of ECMA2018 we can now use the s flag to cause . to match \n (see https://stackoverflow.com/a/36006948/141801).
Thus:
let s = 'I am a string\nover several\nlines.';
console.log('String: "' + s + '".');
let r = /string.*several.*lines/s; // Note 's' modifier
console.log('Match? ' + r.test(s)); // 'test' returns true
This is a recent addition and will not work in many current environments, for example Node v8.7.0 does not seem to recognise it, but it works in Chromium, and I'm using it in a Typescript test I'm writing and presumably it will become more mainstream as time goes by.
Now there's the s (single line) modifier, that lets the dot matches new lines as well :)
\s will also match new lines :D
Just add the s behind the slash
/<pre>.*?<\/pre>/gms
[.\n] doesn't work, because dot in [] (by regex definition; not javascript only) means the dot-character. You can use (.|\n) (or (.|[\n\r])) instead.
I have tested it (Chrome) and it's working for me (both [^] and [^\0]), by changing the dot (.) with either [^\0] or [^] , because dot doesn't match line break (See here: http://www.regular-expressions.info/dot.html).
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[^\0]*?<\/pre>/gm );
alert(arr); //Working
In addition to above-said examples, it is an alternate.
^[\\w\\s]*$
Where \w is for words and \s is for white spaces
[\\w\\s]*
This one was beyond helpful for me, especially for matching multiple things that include new lines, every single other answer ended up just grouping all of the matches together.
While writing an API service for my site, I realized that String.split() won't do it much longer, and decided to try my luck with regular expressions. I have almost done it but I can't find the last bit. Here is what I want to do:
The URL represents a function call:
/api/SECTION/FUNCTION/[PARAMS]
This last part, including the slash, is optional. Some functions display a JSON reply without having to receive any arguments. Example: /api/sounds/getAllSoundpacks prints a list of available sound packs. Though, /api/sounds/getPack/8Bit prints the detailed information.
Here is the expression I have tried:
req.url.match(/\/(.*)\/(.*)\/?(.*)/);
What am I missing to make the last part optional - or capture it in whole?
This will capture everything after FUNCTION/ in your URL, independent of the appearance of any further / after FUNCTION/:
FUNCTION\/(.+)$
The RegExp will not match if there is no part after FUNCTION.
This regex should work by making last slash and part after optional:
/^\/[^/]*\/[^/]*(?:\/.*)?$/
This matches all of these strings:
/api/SECTION/FUNCTION/abc
/api/SECTION
/api/SECTION/
/api/SECTION/FUNCTION
Your pattern /(.*)/(.*)/?(.*) was almost correct, it's just a bit too short - it allows 2 or 3 slashes, but you want to accept anything with 3 or 4 slashes. And if you want to capture the last (optional) slash AND any text behind it as a whole, you simply need to create a group around that section and make it optional:
/.*/.*/.*(?:/.+)?
should do the trick.
Demo. (The pattern looks different because multiline mode is enabled, but it still works. It's also a little "better" because it won't match garbage like "///".)