Extract text containing match between new line characters

Extract text containing match between new line characters - javascript

I am trying to extract paragraphs from OCR'd contracts if that paragraph contains key search terms using JS. A user might search for something such as "ship ahead" to find clauses relating to whether a certain customers orders can be shipped early.
I've been banging my head up against a regex wall for quite some time and am clearly just not grasping something.
If I have text like this and I'm searching for the word "match":
let text = "\n\nThis is an example of a paragraph that has the word I'm looking for The word is Match. \n\nThis paragraph does not have the word I want."
I would want to extract all the text between the double \n characters and not return the second sentence in that string.
I've been trying some form of:
let string = `[^\n\n]*match[^.]*\n\n`;
let re = new RegExp(string, "gi");
let body = text.match(re);
However that returns null. Oddly if I remove the periods from the string it works (sorta):
[
"This is an example of a paragraph that has the word I'm looking for The word is Match \n" +
'\n'
]
Any help would be awesome.

Extracting some text between identical delimiters containing some specific text is not quite possible without any hacks related to context matching.
Thus, you may simply split the text into paragraphs and get those containing your match:
const results = text.split(/\n{2,}/).filter(x=>/\bmatch\b/i.test(x))
You may remove word boundaries if you do not need a whole word match.
See the JavaScript demo:
let text = "\n\nThis is an example of a paragraph that has the word I'm looking for The word is Match. \n\nThis paragraph does not have the word I want.";
console.log(text.split(/\n{2,}/).filter(x=>/\bmatch\b/i.test(x)));

That's pretty easy if you use the fact that a . matches all characters except newline by default. Use regex /.*match.*/ with a greedy .* on both sides:
const text = 'aaaa\n\nbbb match ccc\n\nddd';
const regex = /.*match.*/;
console.log(text.match(regex).toString());
Output:
bbb match ccc

Here is two ways to do it. I am not sure why u need to use regular expression. Split seems much easier to do, isn't it?
const text = "\n\nThis is an example of a paragraph that has the word I'm looking for The word is Match. \n\nThis paragraph does not have the word I want."
// regular expression one
function getTextBetweenLinesUsingRegex(text) {
const regex = /\n\n([^(\n\n)]+)\n\n/;
const arr = regex.exec(text);
if (arr.length > 1) {
return arr[1];
}
return null;
}
console.log(`getTextBetweenLinesUsingRegex: ${ getTextBetweenLinesUsingRegex(text)}`);
console.log(`simple: ${text.split('\n\n')[1]}`);

Related

Search for full word instead of part inside of it

I'm trying to find exact word in text user is sending, but obviously, when I'm trying to use message.content.includes(), it's also looking for parts of the word in text which I don't need! Any way to search by full words only?
Few examples: **TexT**, HeLlO, etc.

Yes, you can use a regex like this, assuming wordToFind holds the word you are searching for:
// Create regex from word with \b at each end which
// means "word boundary", and the 'i' option means
// case-insensitive
const wordSearch = new RegExp(`\b${wordToFind}\b`, 'i');
// Use the regular expression to test the content:
const hasWord = wordSearch.test(message.content);
// will be true if whole word is found

Put your content in to an array, splitting words if it's a piece of text. Array.includes will match whole words.
So either use [content].includes(word) if content is a single word, or content.split(' ').includes(word) if content is multiple words.

Regex to match string in a sentence

I am trying to find a strictly declared string in a sentence, the thread says:
Find the position of the string "ten" within a sentence, without using the exact string directly (this can be avoided in many ways using just a bit of RegEx). Print as many spaces as there were characters in the original sentence before the aforementioned string appeared, and then the string itself in lowercase.
I've gotten this far:
let words = 'A ton of tunas weighs more than ten kilograms.'
function findTheNumber(){
let regex=/t[a-z]*en/gi;
let output = words.match(regex)
console.log(words)
console.log(output)
}
console.log(findTheNumber())
The result should be:
input = A ton of tunas weighs more than ten kilograms.
output = ten(ENTER)

You could try a regex replacement approach, with the help of a callback function:
var input = "A ton of tunas weighs more than ten kilograms.";
var output = input.replace(/\w+/g, function(match, contents, offset, input_string)
{
if (!match.match(/^[t][e][n]$/)) {
return match.replace(/\w/g, " ");
}
else {
return match;
}
});
console.log(input);
console.log(output);
The above logic matches every word in the input sentence, and then selectively replaces every word which is not ten with an equal number of spaces.

You can use
let text = 'A ton of tunas weighs more than ten kilograms.'
function findTheNumber(words){
console.log( words.replace(/\b(t[e]n)\b|[^.]/g, (x,y) => y ?? " ") )
}
findTheNumber(text)
The \b(t[e]n)\b is basically ten whole word searching pattern.
The \b(t[e]n)\b|[^.] regex will match and capture ten into Group 1 and will match any char but . (as you need to keep it at the end). If Group 1 matches, it is kept (ten remains in the output), else the char matched is replaced with a space.
Depending on what chars you want to keep, you may adjust the [^.] pattern. For example, if you want to keep all non-word chars, you may use \w.

Regex expression that matches a specific pattern but not matches another pattern

I need to put italic tags around words which are wrapped between underscores, like we do here on stack overflow format options.
I can easily do this by using this regular expression /_(.*?)_/gi. But the thing is that i don't want to put those tags in between email addresses, urls etc. So i need a regex that matches an italic pattern but not matches with url or email pattern.
let urlExp = /(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/gi;
let boldExp = /\*(.*?)\*/gi;
let italicExp = /\_(.*?)\_/gi;
let bulletedExp = /^\-\s(.*)/gm;
let emailExp = /([a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6})/gi;
let modifiedText = text
.replace(urlExp, "<a href='$1' target='_blank'>$1</a>")
.replace(emailExp, '$1')
.replace(boldExp, "<strong>$1</strong>")
.replace(italicExp, "<i>$1</i>")
.replace(bulletedExp, "• $1");
return modifiedText;
Here is the code that i am working on. The issue here is that the bold, italic and bullets are also applied on urls and emails, i need to skip these two things.

You can use word boundaries (\b) to determine that we're at the end of a word or words with the underscores, not in the middle.
See regex101 for the example, using this regex:
\b_(.*?)_\b

Just check if there are spaces or beginning or end of string:
let italicExp = /(^|\s)_(.*?)_(\s|$)/gi;
let modifiedText = text.replace(italicExp, "$1<i>$2</i>$3")

regex lookbehind in javascript

i im trying to match some words in text
working example (what i want) regex101:
regex = /(?<![a-z])word/g
text = word 1word !word aword
only the first three words will be matched which is what i want to achieve.
but the look behind will not work in javascript :(
so now im trying this regex101:
regex = /(\b|\B)word/g
text = word 1word !word aword
but all words will match and they may not be preceded with an other letter, only with an integer or special characters.
if i use only the smaller "\b" the 1word wont matchand if i only use the "\B" the !word will not match
Edit
The output should be ["word","word","word"]
and the 1 ! must not be included in the match also not in another group, this is because i want to use it with javascript .replace(regex,function(match){}) which should not loop over the 1 and !
The code i use it for
for(var i = 0; i < elements.length; i++){
text = elements[i].innerHTML;
textnew = text.replace(regexp,function(match){
matched = getCrosslink(match)[0];
return "<a href='"+matched.url+"'>"+match+"</a>";
});
elements[i].innerHTML = textnew;
}

Capturing the leading character
It's difficult to know exactly what you want without seeing more output examples, but what about looking for either starts with boundary or starts with a non-letter. Like this for example:
(\bword|[^a-zA-Z]word)
Output: ['word', '1word', '!word']
Here is a working example
Capturing only the "word"
If you only want the "word" part to be captured you can use the following and fetch the 2nd capture group:
(\b|[^a-zA-Z])(word)
Output: ['word', 'word', 'word']
Here is a working example
With replace()
You can use specific capture groups when defining the replace value, so this will work for you (where "new" is the word you want to use):
var regex = /(\b|[^a-zA-Z])(word)/g;
var text = "word 1word !word aword";
text = text.replace(regex, "$1" + "new");
output: "new 1new !new aword"
Here is a working example
If you are using a dedicated function in replace, try this:
textnew = text.replace(regexp,function (allMatch, match1, match2){
matched = getCrosslink(match2)[0];
return "<a href='"+matched.url+"'>"+match2+"</a>";
});
Here is a working example

You can use the following regex
([^a-zA-Z]|\b)(word)
Simply use replace like as
var str = "word 1word !word aword";
str.replace(/([^a-zA-Z]|\b)(word)/g,"$1"+"<a>$2</a>");
Regex

Replace words of text area

I have made a javascript function to replace some words with other words in a text area, but it doesn't work. I have made this:
function wordCheck() {
var text = document.getElementById("eC").value;
var newText = text.replace(/hello/g, '<b>hello</b>');
document.getElementById("eC").innerText = newText;
}
When I alert the variable newText, the console says that the variable doesn't exist.
Can anyone help me?
Edit:
Now it replace the words, but it replaces it with <b>hello</b>, but I want to have it bold. Is there a solution?

Update:
In response to your edit, about your wanting to see the word "hello" show up in bold. The short answer to that is: it can't be done. Not in a simple textarea, at least. You're probably looking for something more like an online WYSIWYG editor, or at least a RTE (Richt Text Editor). There are a couple of them out there, like tinyMCE, for example, which is a decent WYSIWYG editor. A list of RTE's and HTML editors can be found here.
First off: As others have already pointed out: a textarea element's contents is available through its value property, not the innerText. You get the contents alright, but you're trying to update it through the wrong property: use value in both cases.
If you want to replace all occurrences of a string/word/substring, you'll have to resort to using a regular expression, using the g modifier. I'd also recommend making the matching case-insensitive, to replace "hello", "Hello" and "HELLO" all the same:
var txtArea = document.querySelector('#eC');
txtArea.value = txtArea.value.replace(/(hello)/gi, '<b>$1</b>');
As you can see: I captured the match, and used it in the replacement string, to preserve the caps the user might have used.
But wait, there's more:
What if, for some reason, the input already contains <b>Hello</b>, or contains a word containing the string "hello" like "The company is called hellonearth?" Enter conditional matches (aka lookaround assertions) and word boundaries:
txtArea.value = txtArea.value.replace(x.value.replace(/(?!>)\b(hello)\b(?!<)/gi, '<b>$1</b>');
fiddle
How it works:
(?!>): Only match the rest if it isn't preceded by a > char (be more specific, if you want to and use (?!<b>). This is called a negative look-ahead
\b: a word boundary, to make sure we're not matching part of a word
(hello): match and capture the string literal, provided (as explained above) it is not preceded by a > and there is a word boundary
(?!<): same as above, only now we don't want to find a matching </b>, so you can replace this with the more specific (?!<\/b>)
/gi: modifiers, or flags, that affect the entire pattern: g for global (meaning this pattern will be applied to the entire string, not just a single match). The i tells the regex engine the pattern is case-insensitive, ie: h matches both the upper and lowercase character.
The replacement string <b>$1</b>: when the replacement string contains $n substrings, where n is a number, they are treated as backreferences. A regex can group matches into various parts, each group has a number, starting with 1, depending on how many groups you have. We're only grouping one part of the pattern, but suppose we wrote:
'foobar hello foobar'.replace(/(hel)(lo)/g, '<b>$1-$2</b>');
The output would be "foobar <b>hel-lo</b> foobar", because we've split the match up into 2 parts, and added a dash in the replacement string.
I think I'll leave the introduction to RegExp at that... even though we've only scratched the surface, I think it's quite clear now just how powerful regex's can be. Put some time and effort into learning more about this fantastic tool, it is well worth it.

If <textarea>, then you need to use .value property.
document.getElementById("eC").value = newText;
And, as mentioned Barmar, replace() replaces only first word. To replace all word, you need to use simple regex. Note that I removed quotes. /g means global replace.
var newText = text.replace(/hello/g, '<b>hello</b>');
But if you want to really bold your text, you need to use content editable div, not text area:
<div id="eC" contenteditable></div>
So then you need to access innerHTML:
function wordCheck() {
var text = document.getElementById("eC").innerHTML;
var newText = text.replace(/hello/g, '<b>hello</b>');
newText = newText.replace(/<b><b>/g,"<b>");//These two lines are there to prevent <b><b>hello</b></b>
newText = newText.replace(/<\/b><\/b>/g,"</b>");
document.getElementById("eC").innerHTML = newText;
}

We Keep Coding

JavaScript is the programming language of the Web.

Extract text containing match between new line characters - javascript

That's pretty easy if you use the fact that a . matches all characters except newline by default. Use regex /.match./ with a greedy .* on both sides: const text = 'aaaa\n\nbbb match ccc\n\nddd'; const regex = /.match./; console.log(text.match(regex).toString()); Output: bbb match ccc

Related

Search for full word instead of part inside of it

Regex to match string in a sentence

Regex expression that matches a specific pattern but not matches another pattern

regex lookbehind in javascript

Replace words of text area

Categories

Resources

We Keep Coding

JavaScript is the programming language of the Web.

Extract text containing match between new line characters - javascript

That's pretty easy if you use the fact that a . matches all characters except newline by default. Use regex /.*match.*/ with a greedy .* on both sides: const text = 'aaaa\n\nbbb match ccc\n\nddd'; const regex = /.*match.*/; console.log(text.match(regex).toString()); Output: bbb match ccc

Related

Search for full word instead of part inside of it

Regex to match string in a sentence

Regex expression that matches a specific pattern but not matches another pattern

regex lookbehind in javascript

Replace words of text area

Categories

Resources

That's pretty easy if you use the fact that a . matches all characters except newline by default. Use regex /.match./ with a greedy .* on both sides: const text = 'aaaa\n\nbbb match ccc\n\nddd'; const regex = /.match./; console.log(text.match(regex).toString()); Output: bbb match ccc