Partial String Match - Dynamic Strings - javascript

For certain dynamic strings like:
covid-19 testing status upto may 05,2021
covid-19 testing status upto may 04,2021
covid-19 testing status upto may 01,2021
....
covid-19 testing status upto {{date}}
and others like:
Jack and Jones are friends
Jack and JC are friends
Jack and Irani are friends
.....
Jack and {{friend-name}} are friends
I want to match the incoming string like:
covid-19 testing status upto may 01,2021
with
covid-19 testing status upto {{date}}
and if there is a match, I want to extract the value of date.
Similarly, for an incoming string like
Jack and JC are friends
I want to match with
Jack and {{friend-name}} are friends
and extract JC or the friend-name. How could I do this?
I am trying to create a setup where dynamic strings like these, can be merged into one. There could be thousands of incoming strings that I want to match against the existing patterns.
INCOMING_STRINGS -------EXISTING-PATTERNS----->
[
covid-19 testing status upto {{date}},
Jack and {{friend-name}} are friends,
....
] ---> FIND THE PATTERN AND EXTRACT THE DYNAMIC VALUE
EDIT
It is not guaranteed that the pattern will always exist in the incoming strings.

It's very easy to use regex for the second of your examples. If you use a capturing group for the "friend name" part you can extract that with ease:
const re = /Jack and ([a-zA-Z]+) are friends/
const inputs = ["Jack and Jones are friends",
"Jack and JC are friends",
"Jack and Irani are friends",
"Bob and John are friends"] // last one wont match
for(let i=0;i<inputs.length;i++){
const match = inputs[i].match(re);
if(match)
console.log("friend=",match[1]);
else
console.log("No match for the string:", inputs[i])
}
The first example is slightly hardeer, but only because the regex is more difficult to write. Assuming the format is always "short month name 2 digit day comma 4 digit year" it is doable
const re = /covid-19 testing status upto ((jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec) (0?[1-9]|[12][0-9]|3[01]),\d{4})/
const inputs = ["covid-19 testing status upto may 05,2021",
"covid-19 testing status upto may 04,2021",
"covid-19 testing status upto may 01,2021",
"covid-19 testing status upto 01/01/2020"] // wrong date format
for(let i=0;i<inputs.length;i++){
const match = inputs[i].match(re);
if(match)
console.log("date=",match[1]);
else
console.log("No match for the string:", inputs[i])
}

It's fairly unclear to me what your actual inputs and outputs should be. Here's an attempt that guesses at that. With inputs like
[{
sample: 'covid-19 testing status upto may 05,2021',
extract: 'may 05,2021',
propName: 'date'
}, {
sample: 'Jack and Jones are friends',
extract: 'Jones',
propName: 'friend-name'
}]
we generate a function which can be used like this:
mySubs ('Jack and William are friends')
//=> {"friend-name": "William"}
or
(mySubs ('covid-19 testing status upto apr 30,2021')
//=> {"date": "apr 30,2021"}
or
mySubs ('Jack and Jessica are friends who dicsussed covid-19 testing status upto apr 27,2021')
//=> {"date": "apr 27,2021", "friend-name": "Jessica"}
and which would yield an empty object if nothing matched.
We do this by dynamically generating regular expressions for our samples, ones which will capture the substitutions made:
const regEscape = (s) =>
s .replace (/[-\/\\^$*+?.()|[\]{}]/g, '\\$&');
const makeTester = ({sample, extract, propName}) => ({
regex: new RegExp (
regEscape (sample .slice (0, sample .indexOf (extract))) +
'(.+)' +
regEscape (sample .slice (sample .indexOf (extract) + extract .length))
),
propName
})
const substitutes = (configs, testers = configs .map (makeTester)) => (sentence) =>
Object.assign( ...testers .flatMap (({regex, propName}) => {
const match = sentence .match (regex)
return (match)
? {[propName]: match[1]}
: {}
}))
const configs = [{
sample: 'covid-19 testing status upto may 05,2021',
extract: 'may 05,2021',
propName: 'date'
}, {
sample: 'Jack and Jones are friends',
extract: 'Jones',
propName: 'friend-name'
}]
const mySubs = substitutes (configs)
console .log (mySubs ('Jack and William are friends'))
console .log (mySubs ('covid-19 testing status upto apr 30,2021'))
console .log (mySubs ('Jack and Jessica are friends who dicsussed covid-19 testing status upto apr 27,2021'))
console .log (mySubs ('Some random string that does not match'))
.as-console-wrapper {max-height: 100% !important; top: 0}
If you needed to also report what templates matched, you could add a name to each template, and then carry the results through the two main functions to give results like this:
{"covid": {"date": "apr 27,2021"}, "friends": {"friend-name": "Jessica"}}
It's only slightly more complex:
const regEscape = (s) =>
s .replace (/[-\/\\^$*+?.()|[\]{}]/g, '\\$&');
const makeTester = ({name, sample, extract, propName}) => ({
regex: new RegExp (
regEscape (sample .slice (0, sample .indexOf (extract))) +
'(.+)' +
regEscape (sample .slice (sample .indexOf (extract) + extract .length))
),
propName,
name
})
const substitutes = (configs, testers = configs.map(makeTester)) => (sentence) =>
Object.assign( ...testers .flatMap (({name, regex, propName}) => {
const match = sentence .match (regex)
return (match)
? {[name]: {[propName]: match[1]}}
: {}
}))
const configs = [{
name: 'covid',
sample: 'covid-19 testing status upto may 05,2021',
extract: 'may 05,2021',
propName: 'date'
}, {
name: 'friends',
sample: 'Jack and Jones are friends',
extract: 'Jones',
propName: 'friend-name'
}]
const mySubs = substitutes (configs)
console .log (mySubs ('Jack and William are friends'))
console .log (mySubs ('covid-19 testing status upto apr 30,2021'))
console .log (mySubs ('Jack and Jessica are friends who dicsussed covid-19 testing status upto apr 27,2021'))
console .log (mySubs ('Some random string that does not match'))
Either way, this has some limitations. It's hard to figure out what to do if the templates overlap in odd ways, etc. It's also possible that you want to match only complete sentences, and my examples of double-matching won't make sense. If so, you can just prepend a '^' and append a '$' to the string passed to new RegExp.
Again, this is a guess at your requirements. The important thing here is that you might be able to dynamically generate regexes to use.

Related

NodeJS - Non-ascii strings. How to determine their encoding, how to extract characters and convert to ascii representation?

I am doing some web scraping with Node-JS and Puppeteer.
Some data is returned to me which looks like this:
{
"priceValue": "£20,000"
}
This string is clearly not an ascii string because of the first character.
How can I determine what encoding/representation, and character width, this string has?
I want to make a logic decision based on the first character. I want to extract the numerical value from this string and assign a "priceCurrency": "GBP" value to the returned object.
To do this, the most logical approach would be to take the first "character" and process it using an if statement.
How can I extract the first character from this string and then compare it to some value as part of an if statement?
How can I convert the remaining string contents to ascii?
Preface
The next following answer takes this (currently closed) follow-up question by the OP into account ... "How to process price / currency strings in a database safe way?".
In order to work around the OP's ASCII-only limitation all currency symbols/characters got provided each in its unicode escaped form, but only when necessary.
Approach / Solution
In addition to the already provided solutions of the other thread there comes another possible implementation which uses a Map instance for a better lookup-performance where the currency-values do serve as (lookup) keys.
The parsing into a currency-part and a not yet fully validated number-part is based on a regex which partly makes use of the Symbol-category of unicode escapes as well as of named capturing groups and an alternation where the latter supports both cost-value variants which are ...
either a leading currency-value and a trailing number-value
or a leading number-value and a trailing currency-value.
The pattern is as follows ...
/^(?<currency_before>[A-Z]{3}|(?:(?:[A-Z]{2}\s*)?\p{S}))\s*(?<value_after>[\d.,]+)|(?<value_before>[\d.,]+)\s*(?<currency_after>[A-Z]{3}|(?:(?:[A-Z]{2}\s*)?\p{S}))$/u
... and gets described at the related test-page.
A former more strict pattern which does exclusively target the currency-symbol looks like this ...
/(?<symbol_before>\p{S})\s*(?<value_after>[\d.,]+)|(?<value_before>[\d.,]+)\s*(?<symbol_after>\p{S})/u
... and gets described at its related test-page.
The potential number-value is allowed to be formatted either by a comma as thousands-separator and an optional dot as decimal-marker or by a dot as thousands-separator and an optional comma as decimal-marker (e.g. the German way). This gets achieved by another capturing regex of following pattern ...
/^(?:(?<integer>\d+$)|(?<dot_surrogate>(\d+)?[.,]\d+$)|(?<thousands_dot>\d{1,3}\.(?:\d{3}[.,])+\d+$)|(?<thousands_comma>\d{1,3},(?:\d{3}[.,])+\d+$))/
... where the description can be read at its related test-page.
Example Code
// - currency symbols provided in their escaped form when necessary.
const currencyToISOCodeLookup = new Map([
// - map entries, each of following tuple:
// [<currencyKey>, <isoCodeValue>]
['\u20ac', 'EUR'], ['EUR', 'EUR'],
['\u00a3', 'GBP'], ['GBP', 'GBP'],
['$', 'USD'], ['US$', 'USD'], ['USD', 'USD'],
['HK$', 'HKD'], ['HKD', 'HKD'],
]);
// - original code ... same as above ... no escaping.
/*const currencyToISOCodeLookup = new Map([
// - map entries, each of following tuple:
// [<currencyKey>, <isoCodeValue>]
['€', 'EUR'], ['EUR', 'EUR'],
['£', 'GBP'], ['GBP', 'GBP'],
['$', 'USD'], ['US$', 'USD'], ['USD', 'USD'],
['HK$', 'HKD'], ['HKD', 'HKD'],
]);*/
const regXCostPartials =
// see ... [https://regex101.com/r/yPhxbp/3]
/^(?<currency_before>[A-Z]{3}|(?:(?:[A-Z]{2}\s*)?\p{S}))\s*(?<value_after>[\d.,]+)|(?<value_before>[\d.,]+)\s*(?<currency_after>[A-Z]{3}|(?:(?:[A-Z]{2}\s*)?\p{S}))$/u;
const regXCostNumberFormats =
// see ... [https://regex101.com/r/yPhxbp/1]
/^(?:(?<integer>\d+$)|(?<dot_surrogate>(\d+)?[.,]\d+$)|(?<thousands_dot>\d{1,3}\.(?:\d{3}[.,])+\d+$)|(?<thousands_comma>\d{1,3},(?:\d{3}[.,])+\d+$))/;
function parseCostNumber(val) {
const {
integer,
dot_surrogate,
thousands_dot,
thousands_comma,
} = regXCostNumberFormats
.exec(
String(val).trim()
)
?.groups ?? {};
return Number(
thousands_dot?.replace((/\./g), '').replace((/,/), '.') ||
thousands_comma?.replace((/,/g), '') ||
dot_surrogate?.replace((/,/), '.') ||
integer
);
}
function parseCostValue(val) {
const {
currency_before,
currency_after,
value_before,
value_after,
} = regXCostPartials
.exec(
String(val).trim()
)
?.groups ?? {};
const value = parseCostNumber(value_after || value_before);
const code = currencyToISOCodeLookup
.get(
(currency_before || currency_after)?.replace((/\s+/), '')
);
// // here one could throw.
// if (!code || Number.isNaN(value)) {
// throw new EvalError('Cost could not be parsed from "${ val }".');
// }
return (!code || Number.isNaN(value))
? null
: { value, code };
}
const costList = [
'£.457,', // - fails.
'€ 2,000.0.0', // - fails.
'HK D 22', // - fails.
'22 HK D', // - fails.
'20EURO', // - fails.
'2 EURO', // - fails.
'€2,000.0.0', // - fails.
' £ .20', // - doesn't fail anymore from this point.
' £ .45 ',
'£ .456 ',
' .457£ ',
' .9999 £',
'0.001€',
'0,02 €',
' EUR 2',
' EUR30 ',
' 421EUR',
'635 EUR',
' €733,0000',
'819.1200 €',
' US$21,000.00 ',
'US $22,000.00 ',
' USD23,000.00 ',
' HK$ 24,000.00 ',
' HK $ 25,000.00 ',
'HKD 26,000.00 ',
'2.856,1124£',
'3,700.999 £',
'22,534.365£',
'€ 6.342.567,40',
'€ 8,263,157.22',
'€74,522,927.33',
];
console.log(
costList
.map(parseCostValue)
);
.as-console-wrapper { min-height: 100%!important; top: 0; }

trying to get the most used word using regex

I am trying to get 10 most frequent word in the sentence below, I need to use regular expression.
let paragraph = `I love teaching. If you do not love teaching what else can you love. I love Python if you do not love something which can give you all the capabilities to develop an application what else can you love.
I want an output like this
{word:'love', count:6},
{word:'you', count:5},
{word:'can', count:3},
{word:'what', count:2},
{word:'teaching', count:2},
{word:'not', count:2},
{word:'else', count:2},
{word:'do', count:2},
{word:'I', count:2},
{word:'which', count:1},
{word:'to', count:1},
{word:'the', count:1},
{word:'something', count:1},
{word:'if', count:1},
{word:'give', count:1},
{word:'develop',count:1},
{word:'capabilities',count:1},
{word:'application', count:1},
{word:'an',count:1},
{word:'all',count:1},
{word:'Python',count:1},
{word:'If',count:1}]```
This is a solution without regexp, but maybe it is also worth looking at?
const paragraph = `I love teaching. If you do not love teaching what else can you love. I love Python if you do not love something which can give you all the capabilities to develop an application what else can you love.`;
let res=Object.entries(
paragraph.toLowerCase()
.split(/[ .,;-]+/)
.reduce((a,c)=>(a[c]=(a[c]||0)+1,a), {})
).map(([k,v])=>({word:k,count:v})).sort((a,b)=>b.count-a.count)
console.log(res.slice(0,10)) // only get the 10 most frequent words
I have something a bit messy but it uses regex and displays top 10 of the highest occuring results which is what you asked for.
Test it and let me know if it works for you.
let paragraph = "I love teaching. If you do not love teaching what else can you love. I love Python if you do not love something which can give you all the capabilities to develop an application what else can you love.";
//remove periods, because teaching and teaching. will appear as different results set
paragraph = paragraph.split(".").join("");
//results array where results will be stored
var results = []
//separate each string from the paragraph
paragraph.split(" ").forEach((word) => {
const wordCount = paragraph.match(new RegExp(word,"g")).length
//concatenate the word to its occurence:: e.g I:3 ::meaning I has appeared 3 times
const res = word + " : " + wordCount;
//check if the word has been added to results
if(!results.includes(res)){
//if not, push
results.push(res)
}
})
function sortResultsByOccurences(resArray) {
//we use a sort function to sort our results into order: highest occurence to lowest
resArray.sort(function(a, b) {
///\D/g is regex that removes anything that's not a digit, so that we can sort by occurences instead of letters as well
return(parseInt(b.replace(/\D/g, ""), 10) -
parseInt(a.replace(/\D/g, ""), 10));
});
//10 means we are using a decimal number system
return(resArray);
}
//reassign results as sorted
results = sortResultsByOccurences(results);
for(let i = 0; i < 10; i++){//for loop is used to display top 10
console.log(results[i])
}
To get all words in a sentence use regular expressions:
/(\w+)(?=\s)/g.
If you use this in your input string then you get all words without the word which end with full-stop(.) i.e don't match the word "love.".
paragraph.match(/(\w+)(?=(\s|\.|\,|\;|\?))/gi)
So, in this case we have to modify the regex as:
/(\w+)(?=(\s|\.))/g.
Similarly, add the other special(,; ...) character which is end with some word.
This is your solution (please add the other special character if it's required).
let paragraph = `I love teaching. If you do not love teaching what else can you love. I love Python if you do not love something which can give you all the capabilities to develop an application what else can you love.`;
let objArr = [];
[...new Set(paragraph.match(/(\w+)(?=(\s|\.|\,|\;|\?))/gi))].forEach(ele => {
objArr.push({
'word': ele,
'count': paragraph.match(new RegExp(ele+'(?=(\\s|\\.|\\,|\\;|\\?))', 'gi'))?.length
})
});
objArr.sort((x,y) => y.count - x.count);

splitting an string list with proper aligning the string elements

Example Company 1,company ltd 2,company, Inc.,company Nine nine, ltd,company ew So here is example of the string, I want to split it like that it consider Company 1 as one company and company, Inc. as one, but here got situation in company, Inc. it condidering 2 companies while this logic. how can I resolve this? Lke with such strings company, Inc. I want to consider it one element only
const company = company.split(",");
Here the string can be anything, this is just example for the string, but it can be any name. So I am looking for generic logic which works for any string, having same structure of string.
Note $ ==(,) represents as separation point, kept to get clarity that from that point I need to separate the string
Object:
Example 1
{
_id: 5de4debcccea611e4d14d4d5
companies: One Bros. Inc. & Might Bros. Dist. Corp.$Pages, Inc.$Google Inc. Search$Aphabet Inc. tech.
}
Example 2
{
_id: 5de4debccc333611e4d14d4f5
companies: Google Comp. Inc.$Google Comp. Inc. Estd.$Tree, Ltd.$Tree, Ltd.
}
First I split on 'ompany' rather than 'company', because you have one instance of 'Company' with a capital C -- see the output of the first console log within a comment below.
Then I put things back together using reduce -- map is not the right choice here, as I need an array that is one fewer than the size of the fragments I generated. Then though since I need an array that corresponds to the number of strings we want to return, which is one fewer than the number of fragments, the first thing I do inside my reduce is ensure I do not look beyond the end of the array.
Then I split each fragment and pop off the last element, which just puts either "C" or "c" back together with "ompany". Then I replace any trailing ',c' from the next fragment with an empty string, and add the result to the company. Finally I add the entire result to the array I'm generating with reduce. See comment results at bottom. Also here it is on repl.it: https://repl.it/#dexygen/splitOnCompanyStringLiteral
This is a fairly concise way to do this but again if you can do anything to improve your data, you won't have to use such unnecessarily complicated code.
const companiesStr = "Company 1,company ltd 2,company, Inc.,company Nine nine, ltd,company ew";
const companySuffixFragments = companiesStr.split("ompany");
console.log(companySuffixFragments);
/*
[ 'C', ' 1,c', ' ltd 2,c', ', Inc.,c', ' Nine nine, ltd,c', ' ew' ]
*/
const companiesArr = companySuffixFragments.reduce((companies, fragment, index, origArr) => {
if (index < companySuffixFragments.length - 1) {
let company = fragment.split(',').pop() + 'ompany'
company = company + origArr[index + 1].replace(/,c$/, '');
companies.push(company);
}
return companies
}, []);
console.log(companiesArr);
/*
[ 'Company 1',
'company ltd 2',
'company, Inc.',
'company Nine nine, ltd',
'company ew' ]
*/
First change , with any other symbol. I am using & here and then split string with ,
var str= 'Company 1,company ltd 2,company, Inc.,company Nine nine, ltd,company ew';
str = str.replace(', Inc.','& Inc.');
/*str = str.replace(', ltd','& ltd');*/
console.log(str.split(',').map((e)=>{return e.replace('&',',').trim()}));
try with the below solution.
var str = ["company 1","company ltd 2","company", "Inc.","company Nine nine", "ltd","company ews"];
var str2 =str.toString()
var str3 = str2.split("company")
function myFunction(item, index,arr){if(item !=""){let var2 = item.replace(/,/g," ");var2 = "Company"+var2;arr[index]=var2;} }
str3.forEach(myFunction)
OUtput:
str3
(6) ["", "Company 1 ", "Company ltd 2 ", "Company Inc. ", "Company Nine nine ltd ", "Company ews"]
And remove the first element of the array.
As has been commented I'd try to get a more clean String so that you don't have to write "strange" code to get what you need.
If you can't do that right now this code should solve your problem:
let string = 'Company 1,company ltd 2,company, Inc.,company Nine nine, ltd,company
ew';
let array = string.split(',');
const filterFnc = (array) => {
let newArr = [],
i = 0;
for(i = 0; i < array.length; i++) {
if(array[i].toLowerCase().indexOf('company') !== -1) {
newArr.push(array[i]);
} else {
newArr.splice(newArr.length - 1, 1, `${array[i - 1]}, ${array[i]}`);
}
}
return newArr;
};
let filteredArray = filterFnc(array);

How to match a verb in any tense in Compromise.js

The rather excellent compromise.js offers, among other things, a match function.
I'm struggling to get it to work on variants of a verb:
var nlp = require('compromise');
var sentences = [
'I am discharging you',
'I have discharged you',
'I will discharge him',
'I discharged you',
'Monkey'
];
let doc = nlp(sentences.join('. '));
console.log(doc.match('discharge').sentences().out('text'));
/* Output:
discharge
*/
Above only matches 1 sentence out of an expected 4.
How can I get it to match all 4 sentences shown above that contain a conjugate of the word 'discharge'?
Running the following does correctly find the conjugations of the verb 'discharge':
doc.verbs().conjugate()
/* Output:
[ { PastTense: 'discharged',
PresentTense: 'discharges',
Infinitive: 'discharge',
Gerund: 'discharging',
Actor: 'discharger',
FutureTense: 'will discharge' },
{ PastTense: 'had',
PresentTense: 'has',
Infinitive: 'have',
Gerund: 'having',
Actor: 'haver',
Participle: 'had',
FutureTense: 'will have' },
{ PastTense: 'discharged',
PresentTense: 'discharges',
Infinitive: 'discharge',
Gerund: 'discharging',
Actor: 'discharger',
FutureTense: 'will discharge' },
{ PastTense: 'discharged',
PresentTense: 'discharges',
Infinitive: 'discharge',
Gerund: 'discharging',
Actor: 'discharger',
FutureTense: 'will discharge' } ]
*/
The goal of .match() is to provide a quick way to describe any
grammatical pattern, or match condition, using a human-readable, and
mostly-reasonable style. Ref
You can use regex pattern in match and you don't need sentences
var nlp = nlp
var sentences = ['I am discharging you','I have discharged you','I will discharge him','I discharged you','Monkey'];
let doc = nlp(sentences.join('. '));
console.log(doc.match('/discharg(ing|e|ed)/').out('text'));
// to capture all verbs
console.log(doc.match('#verb').out('array'));
<script src="https://unpkg.com/compromise#latest/builds/compromise.min.js"></script>
early versions of compromise tried to store a 'root' conjugation for every verb, for this purpose, but It became too slow on a large text.
perhaps the best way to do this is to conjugate the terms in the document to a known tense, then look for it.
let doc = nlp('i discharged and was discharging')
doc.verbs().toInfinitive()
doc.match('discharge').length
// 2
https://runkit.com/spencermountain/5d080c35d95eb800198fcc78
cheers

convert number into number and quantifier

How would I convert something like 1200000 to £1.2m but also convert 2675000 to £2.675m
i can get the second one to work but the first one comes out as £12m rather than £1.2m
I have the number in a variable so.
salePrice.toString().replace(/0+$/g, '').replace(/\B(?=(\d{3})+(?!\d))/g, '.')}m
how would i change the second replace to work as I guess it is that one that is causing the issue.
as long as this passes
1200000
1220000
1222000
1222200
1222220
1222222
1020000
1022200
so on and so forth all of them need to be able to pass.
You have Number.prototype.toFixed() option available.
const data = [
2675000,
1200000,
1220000,
1222000,
1222200,
1222220,
1222222,
1020000,
1022200
];
const formatted = data.map(x=> (x/1000000).toFixed(3).replace(/0+$/g, '')); // ["2.675", "1.2", "1.22", "1.222", "1.222", "1.222", "1.222", "1.02", "1.022"]
I haven't included the part with the currency, because you had that figured out already. Shilly's answer is really beautiful. I'm simply proposing another solution, which is a bit shorter.
You divide them by the precision you need. I would advice to keep numbers as numbers as long as possible, since strings used as numbers follow text rules instead of math rules, so you'd have to parse them back to numbers to do anything meaningful apart from formatting the output.
const data = [
2675000,
1200000,
1220000,
1222000,
1222200,
1222220,
1222222,
1020000,
1022200
];
const format_currency = ( prefix, value, precision, suffix ) => `${ prefix }${ value / precision }${ suffix }`;
const million = {
symbol: 'm',
value: 1000000
};
const pounds = '£';
const results = data.map( entry => format_currency( pounds, entry, million.value, million.symbol ));
console.log( results );

Categories