Regex to convert markdown to html - javascript

My goal is to take a markdown text and create the necessary bold/italic/underline html tags.
Looked around for answers, got some inspiration but I'm still stuck.
I have the following typescript code, the regex matches the expression including the double asterisk:
var text = 'My **bold\n\n** text.\n'
var bold = /(?=\*\*)((.|\n)*)(?<=\*\*)/gm
var html = text.replace(bold, '<strong>$1</strong>');
console.log(html)
Now the result of this is : My <\strong>** bold\n\n **<\strong> text.
Everything is great aside from the leftover double asterisk.
I also tried to remove them in a later 'replace' statement, but this creates further issues.
How can I ensure they are removed properly?

With your pattern (?=\*\*)((.|\n)*)(?<=\*\*) you assert (not match) with (?=\*\*) that there is ** directly to the right.
Then directly after that, you capture the ** using ((.|\n)*) so then it becomes part of the match.
Then at the end you assert again with (?<=\*\*) that there is ** directly to the left, but ((.|\n)*) has already matched it.
This way so you will end up with all the ** in the match.
You don't need lookarounds at all, as you are already using a capture group.
In Javascript you could match the ** on the left and right and capture any character in a capture group:
\*\*([^]*?)\*\*
Regex demo
But I would suggest using a dedicated parser to parse markdown instead of using a regex.

Just make another call to replaceAll removing the ** with and empty string.
var text = 'My **bold\n\n** text.\n'
var bold = /(?=\*\*)((.|\n)*)(?<=\*\*)/gm
var html = text.replace(bold, '<strong>$1</strong>');
html = html.replaceAll(/\*\*/gm,'');
console.log(html)

Related

Regex-rule for matching single word during input (TipTap InputRule)

I'm currently experimenting with TipTap, an editor framework.
My goal is to build a Custom Node extension for TipTap that wraps a single word in <w>-Tags, whenever a user is typing text. In TipTap I can write an InputRule with Regex for this purpose
For example the rule /(?:^|\s)((?:~)((?:[^~]+))(?:~))$/ will match text between two tildes (~text~) and wrap it with <strike>-Tags.
Click here for my Codesandbox
I was trying for so long and can't figure it out. Here are the rules that I tried:
/**
* Regex that matches a word node during input
*/
// Will match words between two tilde characters; I'm using this expression from the documentation as my starting point.
//const inputRegex = /(?:^|\s)((?:~)((?:[^~]+))(?:~))$/
// Will match a word but will append the following text to that word without the space inbetween
//const inputRegex = /\b\w+\b\s$/
// Will match a word but will append the following text to previous word without the space inbetween; Will work with double spaces
//const inputRegex = /(?:^|\s\b)(?:[^\s])(\w+\b)(?:\s)$/
// Will match a word but will swallow every second character
//const inputRegex = /\b([^\s]+)\b$/g
// Will match every second word
//const inputRegex = /\b([^\s]+)\b\s(?:\s)$/
// Will match every word but swallow spaces; Will work if I insert double spaces
const inputRegex = /\b([^\s]+)(?:\b)\s$/
The problem here is the choice of delimiter, which is space.
This becomes clear when we see the code for markInputRule.ts (line 37 to be precise)
if (captureGroup) {
const startSpaces = fullMatch.search(/\S/)
const textStart = range.from + fullMatch.indexOf(captureGroup)
const textEnd = textStart + captureGroup.length
const excludedMarks = getMarksBetween(range.from, range.to, state.doc)
When we are using '~' as delimiters, the input rule tries to place the markers for start and end, without the delimiters and provide the enclosed-text to the extension tag (CustomItalic, in your case). You can clearly test this when entering strike-through text with enclosing '~', in which case the '~' are extracted out and the text is put inside the strike-through tag.
This is exactly the cause of your double-space problem, when you are getting the match of a word with space, the spaces are replaced and then the text is entered into the tag.
I have tried to work around this using negative look-ahead patterns, but the problem remains in the code of the file mentioned above.
What I would suggest here is to copy the code in markInputRule.ts and make a custom InputRule as per your requirements, which would be way easier than working with the in-built one. Hope this helps.
I assume the problem lies within the "space". Depending on the browser, the final "space" is either not represented at all in the underlying html (Firefox) or replaced with (e.g. Chrome).
I suggest you replace the \s with (\s|\ ) in your regex.

Remove everything after constant using regex

I've got XML that has additional information, BLAH, in each tag. When creating the tags, I've separated the extra info from the tag name with a constant (XMLSPLIT as constant XML_SPLITTER)... I needed to do this because I'm generating my XML from a JSON object and I can't have multiple keys that are the same thing... but in the XML output, can't have that superfluous stuff.
For example:
....
<SetXMLSPLITBLAH>
<Value>9</Value>
<SetType>
<Name>Foo</Name>
</SetType>
</SetXMLSPLITBLAH>
...
So, after generating the XML, I go through and clean it. I'm trying to do it with a regex. I figure, I want to remove anything on a line after the splitter and replace it with just the >.
let reg = new RegExp("<Set"+XML_SPLITTER+"(.*)\/g");
cleanXML = dirtyXML.replace(reg, "<Set>")
This fails to work.
I will note, that I reg = /<Set(.*)/g; and that worked just fine... but it also captures "SetType" and any other use of a tag that starts with "
It's because ^ is a Regex special character that indicates "beginning of line". You'd need to escape it like \^ for this to work. Something like /<Set\^\^[^>]*>/g should do the trick.
Small note: The above regex assumes that the "BLAH" string in your example will never contain the > character... but if it does, then your XML is super malformed anyway.
Using .* will match > and if - for some reason - your XML file is not broken up into multiple lines (i.e. minified), you'll match more than you should. To avoid this, you can use [^>]* to match everything up to the >.
Since you've gracefully included a splitter, it'll make matching much easier and much more predictable (as you mentioned, you match SetType without a splitter).
Without a splitter, you'd have to use a regex pattern that resembles <Set(?!Type>)[^>]* or <Set(?!(?:Type|SomethingElse)>)[^>]* if you had more than just one suffix to Set that should remain. These methods use a negative lookahead to assert what follows does not match.
var str = `<SetXMLSPLITBLAH>
<Value>9</Value>
<SetType>
<Name>Foo</Name>
</SetType>
</SetXMLSPLITBLAH>`
var XML_SPLITTER = 'XMLSPLIT'
var p = `(</?)Set${XML_SPLITTER}[^>]*`
var r = new RegExp(p,'g')
x = str.replace(r,'$1Set')
console.log(x)

comparing and replacing using regex in javascript : leaving a word in between

I am trying to replace a pattern as below:
Original :
welocme
Need to be replaced as :
welcome
Tried the below approach:
String text = "welocme";
Pattern linkPattern = Pattern.compile("a href=\"#");
text = linkPattern.matcher(text).replaceAll("a href=\"javascript:call()\"");
But not able to add the idvalue in between. Kindly help me out.
Thanks in advance.
how about a simple
text.replaceAll("#idvalue","javascript:call('idvalue')")
for this case only. If you are looking to do something more comprehensive, then as suggested in the other answer, an XML parser would be ideal.
Try getting the part that might change and you want to keep as a group, e.g. like this:
text = text.replaceAll( "href=\"#(.*?)\"", "href=\"javascript:call('$1')" );
This basically matches and replaces href="whatever" with whatever being caught by capturing group 1 and reinserted in the replacement string by using $1 as a reference to the content of group 1.
Note that applying regex to HTML and Javascript might be tricky (single or double quotes allowed, comments, nested elements etc.) so it might be better to use a html parser instead.
Add a capture group to the matcher regex and then reference the group in the replacemet. I found using the JavaDoc for Matcher, that you need to use '$' instead of '\' to access the capture group in the replacement.
Code:
String text = "welcome";
System.out.println("input: " + text);
Pattern linkPattern = Pattern.compile("a href=\"#([^\"]+)\"");
text = linkPattern.matcher(text).replaceAll("a href=\"javascript:call('$1')\"");
System.out.println("output: " +text);
Result:
input: welcome
output: welcome

Match string with regex

I need help with a Regex in JavaScript (for a Photoshop script) to match bold tags around words in a string. (not worried about italic or bolditalic at this time).
I don't want to split the string at this stage, I just want to chop it up into certain alternating chunks into using match.
// Be <b>bold!</b> Be fabulous!
Should get match to // ("Be ", "bold!", "Be fabulous!") // line commented for obvious reasons
After that, I'll remove the bold tags - unless Regex can do that in one pass - don't underestimate it's power!
This is what I have so far
(.*?)([<b>]+[\S]+[<\/b>]+[\s]+)+(.*)/g
Only it doesn't match everything as seen here
Just for the record, before anyone suggests a much easier JS solution:
In the Photoshop DOM you can't script regular text mixed with bold. You probably can with Action Manager code, but with generating text that could be a big headache.
To get around this (not an ideal solution) I'll be using regular text & splitting it up at the appropriate places & swapping to bold.
[<b>] is character class, use simply <b> instead.
/(.*?)(<b>+\S+<\/b>+\s+)+(.*)/g
and change \S to [^<]
/(.*?)(<b>+[^<]+<\/b>+\s+)+(.*)/g
You can try:
<b>(.*?)<\/b>
Here is online demo
sample code:
var re = /<b>(.*?)<\/b>/gi;
var str = 'Be <b>bold!</b> Be fabulous! ';
var subst = '$1';
var result = str.replace(re, subst);
output:
Be bold! Be fabulous!
Better try with String.split() function:
var re = /\s*<\/?b>\s*/gi;
var str = 'Be <b>bold!</b> Be fabulous!';
console.log(str.split(re));
output:
["Be", "bold!", "Be fabulous!"]

Replace words of text area

I have made a javascript function to replace some words with other words in a text area, but it doesn't work. I have made this:
function wordCheck() {
var text = document.getElementById("eC").value;
var newText = text.replace(/hello/g, '<b>hello</b>');
document.getElementById("eC").innerText = newText;
}
When I alert the variable newText, the console says that the variable doesn't exist.
Can anyone help me?
Edit:
Now it replace the words, but it replaces it with <b>hello</b>, but I want to have it bold. Is there a solution?
Update:
In response to your edit, about your wanting to see the word "hello" show up in bold. The short answer to that is: it can't be done. Not in a simple textarea, at least. You're probably looking for something more like an online WYSIWYG editor, or at least a RTE (Richt Text Editor). There are a couple of them out there, like tinyMCE, for example, which is a decent WYSIWYG editor. A list of RTE's and HTML editors can be found here.
First off: As others have already pointed out: a textarea element's contents is available through its value property, not the innerText. You get the contents alright, but you're trying to update it through the wrong property: use value in both cases.
If you want to replace all occurrences of a string/word/substring, you'll have to resort to using a regular expression, using the g modifier. I'd also recommend making the matching case-insensitive, to replace "hello", "Hello" and "HELLO" all the same:
var txtArea = document.querySelector('#eC');
txtArea.value = txtArea.value.replace(/(hello)/gi, '<b>$1</b>');
As you can see: I captured the match, and used it in the replacement string, to preserve the caps the user might have used.
But wait, there's more:
What if, for some reason, the input already contains <b>Hello</b>, or contains a word containing the string "hello" like "The company is called hellonearth?" Enter conditional matches (aka lookaround assertions) and word boundaries:
txtArea.value = txtArea.value.replace(x.value.replace(/(?!>)\b(hello)\b(?!<)/gi, '<b>$1</b>');
fiddle
How it works:
(?!>): Only match the rest if it isn't preceded by a > char (be more specific, if you want to and use (?!<b>). This is called a negative look-ahead
\b: a word boundary, to make sure we're not matching part of a word
(hello): match and capture the string literal, provided (as explained above) it is not preceded by a > and there is a word boundary
(?!<): same as above, only now we don't want to find a matching </b>, so you can replace this with the more specific (?!<\/b>)
/gi: modifiers, or flags, that affect the entire pattern: g for global (meaning this pattern will be applied to the entire string, not just a single match). The i tells the regex engine the pattern is case-insensitive, ie: h matches both the upper and lowercase character.
The replacement string <b>$1</b>: when the replacement string contains $n substrings, where n is a number, they are treated as backreferences. A regex can group matches into various parts, each group has a number, starting with 1, depending on how many groups you have. We're only grouping one part of the pattern, but suppose we wrote:
'foobar hello foobar'.replace(/(hel)(lo)/g, '<b>$1-$2</b>');
The output would be "foobar <b>hel-lo</b> foobar", because we've split the match up into 2 parts, and added a dash in the replacement string.
I think I'll leave the introduction to RegExp at that... even though we've only scratched the surface, I think it's quite clear now just how powerful regex's can be. Put some time and effort into learning more about this fantastic tool, it is well worth it.
If <textarea>, then you need to use .value property.
document.getElementById("eC").value = newText;
And, as mentioned Barmar, replace() replaces only first word. To replace all word, you need to use simple regex. Note that I removed quotes. /g means global replace.
var newText = text.replace(/hello/g, '<b>hello</b>');
But if you want to really bold your text, you need to use content editable div, not text area:
<div id="eC" contenteditable></div>
So then you need to access innerHTML:
function wordCheck() {
var text = document.getElementById("eC").innerHTML;
var newText = text.replace(/hello/g, '<b>hello</b>');
newText = newText.replace(/<b><b>/g,"<b>");//These two lines are there to prevent <b><b>hello</b></b>
newText = newText.replace(/<\/b><\/b>/g,"</b>");
document.getElementById("eC").innerHTML = newText;
}

Categories