Saving words from text into an array - javascript

I am trying to do an interesting task and currently have no idea how to do it.
I have a wiki page (ex: https://en.wikipedia.org/wiki/Moldova ) and I want to save each word from this page into an array. Further I will need to parse this array to extract some specific words.
Can someone give me a hint how can I save words from a text into an array.
And how can I solve this problem:
-For each word remove punctuation such as ,.()"' etc.
-If the words is an html tag , don't store it.
Thank you.

By using the split() method, it is used to split a string into an array of substrings, and returns the new array. Read more about it here.
var text="your text";
var punctRE = /[\u2000-\u206F\u2E00-\u2E7F\\'!"#$%&()*+,\-.\/:;<=>?#\[\]^_`{|}~]/g;
text.replace(punctRE, ''); // Strip all punctuation from the string.
var myArray=text.split(" "); // Pass an empty space as a separator.

Related

Find and Replace all occurrences of a phrase in a json string using capturing groups

I have a stringified JSON which looks like this:
...
"message":null,"elementId:["xyz1","l9ie","xyz1"]}}]}], "startIndex":"1",
"transitionTime":"3","sourceId":"xyz1","isLocked":false,"autoplay":false
,"mutevideo":false,"loopvideo":false,"soundonhover":false,"videoCntrlVisibility":0,
...,"elementId:["dgff","xyz1","jkh90"]}}]}]
... it goes on.
The part I need to work on is the value of the elementId key. (The 2nd key in the first line, and the last key).
This key is present in multiple places in the JSON string. The value of this key is an array containing 4-character ids.
I need to replace one of these ids with a new one.
The kernel of the idea is something like:
var elemId = 'xyz1' // for instance
var regex = new RegExp(elemId, 'g');
var newString = jsonString.replace(regex, newRandomId);
jsonString = newString;
There are a couple of problems with this approach. The regex will match the id anywhere in the JSON. I need a regex which only matches it inside the elementId array; and nowhere else.
I'm trying to use a capturing group to match just the occurrences I need, but I can't quite crack it. I have:
/.*elementId":\[".*(xyz1).*"\]}}]/
But this doesn't match the 1st occurence of 'xyz1 in the array.
So, firstly, I need a regex which can match all the 'xyz1's inside elementId; but nowhere else. The sequence of square and curly brackets after elementId ends doesn't change anywhere in the string, if that helps.
Secondly, even if I have a capturing group that works, string.replace doesn't act as expected. Instead of replacing just the match inside the capturing group, it replaces the whole match.
So, my second requirement is replacing only the captured groups, not the whole match.
What a need is a piece of js code which will replace my 'xyz1's where needed and return the following string (assuming the newRandomId is 'abcd'):
"message":null,"elementId:["abcd","l9ie","abcd"]}}]}], "startIndex":"1",
"transitionTime":"3","sourceId":"xyz1","isLocked":false,"autoplay":false
,"mutevideo":false,"loopvideo":false,"soundonhover":false,"videoCntrlVisibility":0,
...,"elementId:["dgff","abcd","jkh9"]}}]}]
Note that the value of 'sourceId' is unaffected.
EDIT: I have to work with the JSON. I can't parse it and work with the object since I don't know all the places the old id might be in the object and looping through it multiple times (for multiple elements) would be time-consuming
Assuming you can't just parse and change the JS object, you could use 2 regexes: one to extract the array and the one to change the desired ids inside:
var output = input.replace(/("elementId"\s*:\s*\[)((?:".{4}",?)*)(\])/g, function(_,start,content,end){
return start + content.replace(/"xyz1"/g, '"rand"') + end;
});
The arguments _, start, content, end are produced as result of the regex (documentation here):
_ is the whole matched string (from "elementId:\[ to ]). I choose this name because it's an old convention for arguments you don't use
start is the first group ("elementId:\[)
content is the second captured group, that is the internal part of the array
end id the third group, ]
Using the groups instead of hardcoding the start and end parts in the returned string serves two purposes
avoid duplication (DRY principle)
make it possible to have variable strings (for example in my regex I accept optional spaces after the :)
var input = document.getElementById("input").innerHTML.trim();
var output = input.replace(/("elementId":\s*\[)((?:".{4}",?)*)(\])/g, function(_,start,content,end){
return start + content.replace(/"xyz1"/g, '"rand"') + end;
});
document.getElementById("output").innerHTML = output;
Input:
<pre id=input>
"message":null,"elementId":["xyz1","l9ie","xyz1"]}}]}], "startIndex":"1",
"transitionTime":"3","sourceId":"xyz1","isLocked":false,"autoplay":false
,"mutevideo":false,"loopvideo":false,"soundonhover":false,"videoCntrlVisibility":0,
...,"elementId":["dgff","xyz1","jkh9"]}}]}]
</pre>
Output:
<pre id=output>
</pre>
Notes:
it would be easy to do the whole operation in one regex if they weren't repetition of the searched id in one array. But the present structure makes it easy to handle several ids to replace at once.
I use non captured groups (?:...) in order to unclutter the arguments passed to the external replacing callback

Return an array of strings after filtering #tags

I was wondering if someone could provide me with a JavaScript RegEx statement to filter out the text out of all the #tags in the input box.
Scenario: I have a user input text box where users can enter multiple #tags. What I would like to do is have all the texts filtered out and stored in an array after removing the special characters and save it to the database by looping over the array.
Example: Input- #tag1, #tag2, #tag3...
Output: An array of [tag1, tag2, tag3...]
Thanks in advance..
Use simply:
/(#\w+)+/gmi
And in your list of regex matches, you'll have all of the tags in an array. This expression only supports letters, numbers and underscores - simply adjust the \w if you want to extend or restrict the set of characters.
Here's a regex101 to play around with: https://regex101.com/r/pJ8vA4/2
The javascript would look something like:
var string = '#tag1, #tag2, #tag3 some other stuff #tag4';
var tags = string.match(/(#\w+)+/gmi);
tags = result.map(function(tag) { return tag.replace('#', '') });
console.log(tags);
This is a place to as for questions, not for someone to make you code. But I will still answer.
(\s?#[a-zA-Z0-9]+,?)+
-separated by an optional white-space and optional commas
-make sure to trim the white space off the beginning and end of the returned values (while looping)
-also remove the hash tag (1st character after trimming)
-link for your example https://regex101.com/r/rT2aC5/1
edit: also does not include special characters. Let me know if you need a special modification and I will do it real quick for you :)
var input = "#tag1, #tag2, #tag3";
var regex = new RegExp(/#(\w+),?\s?/gi);
var match = null;
var results = [];
while (match = regex.exec(input)){
results.push(match[1]);
}
This will give you a results array that has: ["tag1", "tag2", "tag3"]

Parse string regex for known keys but leave separator

Ok, So I hit a little bit of a snag trying to make a regex.
Essentially, I want a string like:
error=some=new item user=max dateFrom=2013-01-15T05:00:00.000Z dateTo=2013-01-16T05:00:00.000Z
to be parsed to read
error=some=new item
user=max
dateFrom=2013-01-15T05:00:00.000Z
ateTo=2013-01-16T05:00:00.000Z
So I want it to pull known keywords, and ignore other strings that have =.
My current regex looks like this:
(error|user|dateFrom|dateTo|timeFrom|timeTo|hang)\=[\w\s\f\-\:]+(?![(error|user|dateFrom|dateTo|timeFrom|timeTo|hang)\=])
So I'm using known keywords to be used dynamically so I can list them as being know.
How could I write it to include this requirement?
You could use a replace like so:
var input = "error=some=new item user=max dateFrom=2013-01-15T05:00:00.000Z dateTo=2013-01-16T05:00:00.000Z";
var result = input.replace(/\s*\b((?:error|user|dateFrom|dateTo|timeFrom|timeTo|hang)=)/g, "\n$1");
result = result.replace(/^\r?\n/, ""); // remove the first line
Result:
error=some=new item
user=max
dateFrom=2013-01-15T05:00:00.000Z
dateTo=2013-01-16T05:00:00.000Z
Another way to tokenize the string:
var tokens = inputString.split(/ (?=[^= ]+=)/);
The regex looks for space that is succeeded by (a non-space-non-equal-sign sequence that ends with a =), and split at those spaces.
Result:
["error=some=new item", "user=max", "dateFrom=2013-01-15T05:00:00.000Z", "dateTo=2013-01-16T05:00:00.000Z"]
Using the technique above and adapt your regex from your question:
var tokens = inputString.split(/(?=\b(?:error|user|dateFrom|dateTo|timeFrom|timeTo|hang)=)/);
This will correctly split the input pointed out by Qtax mentioned in the comment: "error=user=max foo=bar"
["error=", "user=max foo=bar"]

Extract text from HTML with Javascript regex

I am trying to parse a webpage and to get the number reference after <li>YM#. For example I need to get 1234-234234 in a variable from the HTML that contains
<li>YM# 1234-234234 </li>
Many thanks for your help someone!
Rich
currently, your regex only matches if there is a single number before the dash and a single number after it. This will let you get one or more numbers in each place instead:
/YM#[0-9]+-[0-9]+/g
Then, you also need to capture it, so we use a cgroup to captue it:
/YM#([0-9]+-[0-9]+)/g
Then we need to refer to the capture group again, so we use the following code instead of the String.match
var regex = /YM#([0-9]+-[0-9]+)/g;
var match = regex.exec(text);
var id = match[1];
// 0: match of entire regex
// after that, each of the groups gets a number
(?!<li>YM#\s)([\d-]+)
http://regexr.com?30ng5
This will match the numbers.
Try this:
(<li>[^#<>]*?# *)([\d\-]+)\b
and get the result in $2.

regular expression (javascript) How to match anything beween two tags any number of times

I'm trying to find all occurrences of items in HTML page that are in between <nobr> and </nobr> tags.
EDIT:(nobr is an example. I need to find content between random strings, not always tags)
I tried this
var match = /<nobr>(.*?)<\/nobr>/img.exec(document.documentElement.innerHTML);
alert (match);
But it gives only one occurrence. + it appears twice, once with the <nobr></nobr> tags and once without them. I need only the version without the tags.
you need to do it in a loop
var match, re = /<nobr>(.*?)<\/nobr>/img;
while((match = re.exec(document.documentElement.innerHTML)) !== null){
alert(match[1]);
}
use the DOM
var nobrs = document.getElementsByTagName("nobr")
and you can then loop through all nobrs and extract the innerHTML or apply any other action on them.
(Since I can't comment on Rafael's correct answer...)
exec is doing what it is supposed to do - finding the first match, returning the result in the match object, and setting you up for the next exec call. The match object contains (at index 0) the whole of the string matched by the whole of the regex. In subsequent slots are the bits of the string matched by the parenthesized subgroups. So match[1] contains the bit of the string matched by "(.*?)" in your example.
you can use
while (match = /<nobr>(.*?)<\/nobr>/img.exec("foo <nobr> hello </nobr> bar <nobr> world </nobr> foobar"))
alert (match[1]);
If the strings you're using aren't xml elements, and you're sticking with regexes the return value you're getting can be explained by the bracketing. .exec returns the whole matching string followed by the contents of the bracketed expressions.
If your doc contains:
This is out.
Bzz. This is in. unBzz.
then
/Bzz.(.*?)unBzz./img.exec(document.documentElement.innerHTML)
Will give you 'Bzz. This is in. unBzz.' in element 0 of the returned array and 'This is in.' in element 1. Trying to display the whole array gives both as a comma separated list because that's what JavaScript does to try to display it.
So
alert($match[1]);
is what you're after.
it takes to steps but you could do it like this
match = document.documentElement.innerHTML.match(/<nobr>(.*?)<\/nobr>/img)
alert(match)//includes '<nobr>'
match_length = match.length;
for (var i = 0; i < match_length; i++)
{
var match2 = match[i].match(/<nobr>(.*?)<\/nobr>/im);//same regex without the g option
alert(match2[1]);
}

Categories