Text Scraping in JavaScript - javascript

I have a dynamic text which looks something like this
my_text = "address ae fae daq ad, 1231 asdas landline 213121233 -123 mobile 513121233 cell (132) -142-3127
email sdasdas#gmail.com , sdasd as#yahoo.com - ewqas#gmail.com"
The text starts with an 'address'. As soon as we see 'address' we need to scrape everything from there until either 'landline'/'mobile'/'cell' appears. From there on, we want to scrape when all the phone text (without altering spaces in between). We start from the first occurrence of either 'landline'/'mobile'/'cell' and stop as soon as we find 'email' appear.
Finally we scrape the email part (without altering spaces in between)
'landline'/'mobile'/'cell' can appear in any order and sometimes some may not appear.
For example, the text could have looked like this as well.
my_text = "address ae fae daq ad, 1231 asdas
cell (132) -142-3127 landline 213121233 -123
email sdasdas#gmail.com , sdasd as#yahoo.com - ewqas#gmail.com"
There's a little more engineering that needs to be done to form arrays of subtext contained in address, phones and email text.
Subtexts of addresses are always separated with commas (,).
Subtexts of emails can be separated with commas (,) or hyphens (-).
My output should be a JSON dictionary which looks something like this:
resultant_dict = {
addresses: [{
address: "ae fae daq ad"
}, {
address: "1231 asdas"
}],
phones: [{
number: "213121233 -123",
kind: "landline"
}, {
number: "513121233",
kind: "mobile"
}, {
number: "(132 -142-3127",
kind: "cell"
}],
emails: [{
email: "sdasdas#gmail.com",
connector: ""
}, {
email: "sdasd as#yahoo.com",
connector: ","
}, {
email: "ewqas#gmail.com",
connector: "-"
}]
}
I am trying to achieve this thing using regular expressions or any other way in JavaScript. I can't figure out how to write this as I am a novice programmer.

Your requirements are a bit twisted... Plural for map keys, section names as a key for each item... Moreover, what about a dedicated array for each "kind" of phone? We can get the expected result for sure, but it's seems pretty useless at first glance. Anyway, here a starting point:
var str = 'address ae fae daq ad, 1231 asdas landline 213121233 -123 mobile 513121233 cell (132) -142-3127 email sdasdas#gmail.com , sdasd as#yahoo.com - ewqas#gmail.com';
// find sections
var s = 'address|landline|mobile|cell|email';
var reSections = new RegExp('(' + s + ').*?(?=' + s + '|$)', 'g');
var slices = str.match(reSections);
document.body.innerHTML += (
'<b>Step 1 - Find sections</b>' +
'<pre>' + JSON.stringify(slices, 0, 2) + '</pre>\n'
);
// make a map
var map = {
address: [],
phone: [],
email: []
};
var reTrim = /^\s+|\s+$/g;
var reSanitize = /\s+(-|,)\s+/g;
var reSection = /^(\w+)(.*)$/;
slices.forEach(function (section) {
var m = section.match(reSection);
var category = 'email address'.indexOf(m[1]) !== -1 ? m[1] : 'phone';
var values = m[2].replace(reSanitize, ',').split(',');
map[category] = map[category].concat(values.map(function (value) {
return { kind: m[1], value: value.replace(reTrim, '') };
}));
});
document.body.innerHTML += (
'<b>Step 2 - Make a map</b>' +
'<pre>' + JSON.stringify(map, 0, 2) + '</pre>\n'
);

A bit hackish solution but works.
Try this :
mymap={};a=str;keys=["address","cell","landline","email"];for(var k in keys){a=a.replace(keys[k],"##"+keys[k])}; console.log(a);b=a.split("##");for(var f in b){x=b[f].split(" ");mymap[x[0]]=x.slice(1).join(" ")}; console.log(mymap);
mymap will contain all the fields which you are looking for. You can parse it to create JSON in your format.

Related

splitting an string list with proper aligning the string elements

Example Company 1,company ltd 2,company, Inc.,company Nine nine, ltd,company ew So here is example of the string, I want to split it like that it consider Company 1 as one company and company, Inc. as one, but here got situation in company, Inc. it condidering 2 companies while this logic. how can I resolve this? Lke with such strings company, Inc. I want to consider it one element only
const company = company.split(",");
Here the string can be anything, this is just example for the string, but it can be any name. So I am looking for generic logic which works for any string, having same structure of string.
Note $ ==(,) represents as separation point, kept to get clarity that from that point I need to separate the string
Object:
Example 1
{
_id: 5de4debcccea611e4d14d4d5
companies: One Bros. Inc. & Might Bros. Dist. Corp.$Pages, Inc.$Google Inc. Search$Aphabet Inc. tech.
}
Example 2
{
_id: 5de4debccc333611e4d14d4f5
companies: Google Comp. Inc.$Google Comp. Inc. Estd.$Tree, Ltd.$Tree, Ltd.
}
First I split on 'ompany' rather than 'company', because you have one instance of 'Company' with a capital C -- see the output of the first console log within a comment below.
Then I put things back together using reduce -- map is not the right choice here, as I need an array that is one fewer than the size of the fragments I generated. Then though since I need an array that corresponds to the number of strings we want to return, which is one fewer than the number of fragments, the first thing I do inside my reduce is ensure I do not look beyond the end of the array.
Then I split each fragment and pop off the last element, which just puts either "C" or "c" back together with "ompany". Then I replace any trailing ',c' from the next fragment with an empty string, and add the result to the company. Finally I add the entire result to the array I'm generating with reduce. See comment results at bottom. Also here it is on repl.it: https://repl.it/#dexygen/splitOnCompanyStringLiteral
This is a fairly concise way to do this but again if you can do anything to improve your data, you won't have to use such unnecessarily complicated code.
const companiesStr = "Company 1,company ltd 2,company, Inc.,company Nine nine, ltd,company ew";
const companySuffixFragments = companiesStr.split("ompany");
console.log(companySuffixFragments);
/*
[ 'C', ' 1,c', ' ltd 2,c', ', Inc.,c', ' Nine nine, ltd,c', ' ew' ]
*/
const companiesArr = companySuffixFragments.reduce((companies, fragment, index, origArr) => {
if (index < companySuffixFragments.length - 1) {
let company = fragment.split(',').pop() + 'ompany'
company = company + origArr[index + 1].replace(/,c$/, '');
companies.push(company);
}
return companies
}, []);
console.log(companiesArr);
/*
[ 'Company 1',
'company ltd 2',
'company, Inc.',
'company Nine nine, ltd',
'company ew' ]
*/
First change , with any other symbol. I am using & here and then split string with ,
var str= 'Company 1,company ltd 2,company, Inc.,company Nine nine, ltd,company ew';
str = str.replace(', Inc.','& Inc.');
/*str = str.replace(', ltd','& ltd');*/
console.log(str.split(',').map((e)=>{return e.replace('&',',').trim()}));
try with the below solution.
var str = ["company 1","company ltd 2","company", "Inc.","company Nine nine", "ltd","company ews"];
var str2 =str.toString()
var str3 = str2.split("company")
function myFunction(item, index,arr){if(item !=""){let var2 = item.replace(/,/g," ");var2 = "Company"+var2;arr[index]=var2;} }
str3.forEach(myFunction)
OUtput:
str3
(6) ["", "Company 1 ", "Company ltd 2 ", "Company Inc. ", "Company Nine nine ltd ", "Company ews"]
And remove the first element of the array.
As has been commented I'd try to get a more clean String so that you don't have to write "strange" code to get what you need.
If you can't do that right now this code should solve your problem:
let string = 'Company 1,company ltd 2,company, Inc.,company Nine nine, ltd,company
ew';
let array = string.split(',');
const filterFnc = (array) => {
let newArr = [],
i = 0;
for(i = 0; i < array.length; i++) {
if(array[i].toLowerCase().indexOf('company') !== -1) {
newArr.push(array[i]);
} else {
newArr.splice(newArr.length - 1, 1, `${array[i - 1]}, ${array[i]}`);
}
}
return newArr;
};
let filteredArray = filterFnc(array);

How do I add <b> (for bolding font) to part of text inside string interpolation for sns message?

I want to take a string in this format I am a level ${level} coder, where ${level} will be some value passed in. But I want only specific word in this sentence bolded. So lets say in this example I want "level" and "coder" bolded. How do I achieve this?
Current Behavior:
Even if I do <b> or <strong> inside `` the tags just get converted to string. It doesn't actually bold the text for me.
Update: This is exactly what I am doing with aws sns. But I want to achieve this with string interpolation.
let snsData = {
Message: < strong > "This is an automated message" < /strong> + '\n' +
"You have successfully uploaded the following:" + '\n'
`File name: ${snsFileName}\n
Number of lines: ${numberOfLines}\n
If there are any issues, please contact XXX for assistance.`,
Subject: 'Successfully Uploaded to XX',
TopicArn: 'XXXXX'
};
Addendum, so this is entirely unique to your instance and I suggest better familiarizing yourself with how string / interpolation and objects work but for the sake of learning, cheers;
const $ = function(id) { return document.getElementById(id) },
level = 'expert',
str = `I am a level <strong>${level} coder</strong>`,
snsFileName = 'testFileNameBlah',
numberOfLines = 99,
snsData = {
Message: '<strong>This is an automated message</strong><br/>' +
'You have successfully uploaded the following:<br/>' +
`File name: <strong>${snsFileName}</strong><br/>
Number of lines: <strong>${numberOfLines}</strong><br/>` +
'If there are any issues, please contact XXX for assistance.<br/>',
Subject: 'Successfully Uploaded to XX',
TopicArn: 'XXXXX'
};
$('blah').innerHTML = str + '<hr>';
$('fixme').innerHTML = snsData.Message + snsData.Subject;
<span id="blah"></span>
<h2>Addendum</h2>
<p id="fixme"></p>

Filter data out of long string (vcard)

I'm scanning data from a vcard QR-code. The string I receive always looks something like this:
BEGIN:VCARD
VERSION:2.1
N:Lastname;Firstname
FN:Firstname Lastname
ORG:Lol Group
TITLE:Project Engineer
TEL;WORK:+32 (0)11 12 13 14
ADR;WORK:Industrielaan 1;2250 Olen;Belgium
EMAIL:link.com
URL:http://www.link.com
END:VCARD
I need some data to automatically fill in the form (I'm doing this in jQuery). I need the firstname, lastname, organisation and telephone number.
So I need the data after N, ORG and TEL. But I'm really stuck on how I could do this the best way. Any experience with this and maybe some tips for me?
UPDATE:
The data varies at times. These are the possibilities:
OPTION 1
BEGIN:VCARD
VERSION:3.0
N:lname;fname;;;
FN:fname lname
TITLE:Project manager
EMAIL;type=INTERNET;type=WORK:s.demesqdqs.be
TEL;type=WORK:+3812788105
END:VCARD
OPTION 2
BEGIN:VCARDFN:Barend VercauterenTEL:+32(0)9 329 93 06EMAIL:Barend.Vercauterenëesc.beURL:http://www.esc.beN:Vercauteren;BarendADR:Grote Steenweg 39;9840;De PinteORG:ESC bvbaROLE:sales consultantVERSION:3.0END:VCARD
OPTION 3
BEGIN:VCARDVERSION:2.1N:Deblieck;Tommy;;DhrFN:Tommy DeblieckTITLE:ZaakvoerderORG:QBMT bvbaADR:;;Kleine Pathoekweg 44;Brugge;West-Vlaanderen;8000;Belgi≠A0171TEL;WORK;PREF:+32 479302972TEL;CELL:+32 479302972EMAIL:tdëqbmt.beURL:www.qbis.beEND:VCARD
As you can see it can happen that all the text is attached to each other .. .
My code for receiving the correct data with option 1:
var fname = /FN:(.*)/g;
var org = /ORG:(.*)/g;
var tel = /TEL;[^:]*:(.*)/g;
var fullname, firstname, morg, mtel;
fullname = fname.exec(qr_data);
fullname = fullname[1];
var array = fullname.split(' ');
firstname = array[0];
array.shift();
var lastname = '';
if(array.length > 1){
$.each(array, function(index, item) {
lastname += item ;
});
}
else
{
lastname = array[0];
}
morg = org.exec(qr_data);
mtel = tel.exec(qr_data);
if(firstname)
{
$("#firstname").val(firstname);
}
if(lastname)
{
$("#name").val(lastname);
}
if(morg)
{
$("#company").val(morg[1]);
}
if(mtel)
{
$("#number").val(mtel[1]);
}
But how can I get these data with the other 2 options?
Use regex to extract the data.
For name = /FN:(.*)/g
For organization = /ORG:(.*)/g
For telephone = /TEL;[^:]*(.*)/g
Check out this fiddle.
var fname = /FN:(.*)/g;
var org = /ORG:(.*)/g;
var tel = /TEL;[^:]*:(.*)/g;
var str = 'BEGIN:VCARD\nVERSION:2.1\nN:Lastname;Firstname\nFN:Firstname Lastname\nORG:Lol Group\nTITLE:Project Engineer\nTEL;WORK:+32 (0)11 12 13 14\nADR;WORK:Industrielaan 1;2250 Olen;Belgium\nEMAIL:link.com\nURL:http://www.link.com\nEND:VCARD';
var mname, morg, mtel;
mname = fname.exec(str);
morg = org.exec(str);
mtel = tel.exec(str);
alert(mname[1]);
alert(morg[1]);
alert(mtel[1]);
In order to parse a vCard correctly, you cannot rely on a single regex expression. There are some vCard parsers that you can leverage.
Here is an example of using Nilclass vCardJS:
VCF.parse(input, function(vcard) {
// this function is called with a VCard instance.
// If the input contains more than one vCard, it is called multiple times.
console.log("Names: ", JSON.stringify(vcard.n)); // Names
console.log("Org: ", JSON.stringify(vcard.org)); // Org
console.log("Tel: ", JSON.stringify(vcard.tel)); // Tel
});
Here are all defined fields:
VCard.allKeys = [
'fn', 'n', 'nickname', 'photo', 'bday', 'anniversary', 'gender',
'tel', 'email', 'impp', 'lang', 'tz', 'geo', 'title', 'role', 'logo',
'org', 'member', 'related', 'categories', 'note', 'prodid', 'rev',
'sound', 'uid'
];
UPDATE:
Here is a regex that you might try. However, it might not be complete, and you will have to adjust it as you get more different field names in the vCard:
(begin|end|version|cell|adr|nickname|photo|bday|anniversary|gender|tel|email|impp|lang|tz|geo|title|role|logo|org|member|related|categories|note|prodid|rev|sound|uid|fn|n):(.*?)(?=(?:begin|end|version|cell|adr|nickname|photo|bday|anniversary|gender|tel|email|impp|lang|tz|geo|title|role|logo|org|member|related|categories|note|prodid|rev|sound|uid|fn|n):|\n|$)
See demo
The first capturing group will contain a field name and the second will contain the field value. Again, you'd be safer with a dedicated parser.
var re = /(begin|end|version|cell|adr|nickname|photo|bday|anniversary|gender|tel|email|impp|lang|tz|geo|title|role|logo|org|member|related|categories|note|prodid|rev|sound|uid|fn|n):(.*?)(?=(?:begin|end|version|cell|adr|nickname|photo|bday|anniversary|gender|tel|email|impp|lang|tz|geo|title|role|logo|org|member|related|categories|note|prodid|rev|sound|uid|fn|n):|\n|$)/gi;
var str = 'BEGIN:VCARDVERSION:2.1N:Deblieck;Tommy;;DhrFN:Tommy DeblieckTITLE:ZaakvoerderORG:QBMT bvbaADR:;;Kleine Pathoekweg 44;Brugge;West-Vlaanderen;8000;Belgi≠A0171TEL;WORK;PREF:+32 479302972TEL;CELL:+32 479302972EMAIL:tdëqbmt.beURL:www.qbis.beEND:VCARD';
var m;
while ((m = re.exec(str)) !== null) {
if (m.index === re.lastIndex) {
re.lastIndex++;
}
if (m[1].toLowerCase() === "n") {
document.write("Names: " + m[2] + "<br/>");
}
else if (m[1].toLowerCase() === "org") {
document.write("Org: " + m[2] + "<br/>");
}
else if (m[1].toLowerCase().indexOf("tel") === 0 ||
m[1].toLowerCase().indexOf("cell") === 0) {
document.write("Tel.: : " + m[2]);
}
}

how to parse & format content of text into object

as the title says I need to extract content out of long text with certain fields.
I have this text as below
Name: David Jones
Office Address: 148 Hulala Street Date: 24/11/2013
Agent No: 1234,
Address: 259 Yolo Road Start Date: 22/11/2013 Due Date: 29/11/2013
Type: Human Properties: None Ago: 29
And I have these labels for specific fields in the text
Name, Office Address, Date, Agent No, Address, Type, Properties, Age
And the result I want to get is
Name: 'David Jones',
Office Address: '148 Hulala Street',
Date: '24/11/2013',
Agent No: '1234',
Address: '259 Yolo Road',
Type: 'Human'
Properties: 'None',
Age: ''
that has completely parsed the content with each field. Important thing to note here is the original text can possibly have typo (E.g., Ago instead of Age) and extra fields that do not exist in the list of labels (E.g., Start Date and Due Date do not exist in the label list). So the code will ignore any un-matching text and try to find only matching result.
I tried to resolve this by going through loops for each line, check if a line contains the field, and see if the line also contains more fields.
Currently I have the following code.
structure = ['Name','Office Address','Date','Agent No','Address','Type','Properties','Age'];
obj = {};
for (i = 0; i < textLines.length; i++) {
matchingFields = [];
for (j = 0; j < structure.length; j++) {
if (textLines[i].indexOf(structure[j] + ':') !== -1) {
if (matchingFields.length === 0 && textLines[i].indexOf(structure[j] + ':') === 0) {
matchingFields.push(structure[j]);
structure.splice(structure.indexOf(structure[j--]), 1);
} else if (textLines[i].indexOf(structure[j] + ':') > textLines[i].indexOf(matchingFields[matchingFields.length-1])) {
matchingFields.push(structure[j]);
structure.splice(structure.indexOf(structure[j--]), 1);
}
}
for (j = 0; j < matchingFields.length; j++) {
if (j !== matchingFields.length-1) {
obj[matchingFields[j]] = textLines[i].slice(textLines[i].indexOf(matchingFields[j]) + matchingFields[j].length, textLines[i].indexOf(matchingFields[j+1]));
} else {
obj[matchingFields[j]] = textLines[i].slice(textLines[i].indexOf(matchingFields[j]) + matchingFields[j].length);
}
obj[matchingFields[j]] = obj[matchingFields[j]].replace(':', '');
if (obj[matchingFields[j]].indexOf(' ') === 0) {
obj[matchingFields[j]] = obj[matchingFields[j]].replace(' ', '');
}
if (obj[matchingFields[j]].charAt(obj[matchingFields[j]].length-1) === ' ') {
obj[matchingFields[j]] = obj[matchingFields[j]].slice(0, obj[matchingFields[j]].length-1);
}
}
}
In some cases it could work fine but with 'Office Address: ' and 'Address: ' existing value for 'Office Address:' goes into 'Address:'. Besides, the code looks messy and ugly. Also seems like kind of brute forcing.
I guess there should be a better way. For example using regular expression or something similar. but no external library.
If you have any idea I will appreciate it for sharing.
Assuming the properties are separated by newline characters, you create an object mapping each attribute to its value using:
var str = "Name: David Jones\nOffice Address: 148 Hulala Street\nDate: 24/11/2013\nAgent No: 1234,\nAddress: 259 Yolo Road\\nType: Human Properties: None Age: 29";
var output = {};
str.split(/\n/).forEach(function(item){
var match = (item.match(/([A-Za-z\s]*):\s([A-Za-z0-9\s\/]*)/));
output[match[1]] = match[2];
});
console.log(output)
This may help:
> a.substr(a.indexOf("Name"), a.indexOf("Office Address")).split(":")
["Name", " David Jones "]

Address String Split in javascript

Ok folks I have bombed around for a few days trying to find a good solution for this one.
What I have is two possible address formats.
28 Main St Somecity, NY 12345-6789
or
Main St Somecity, Ny 12345-6789
What I need to do Is split both strings down into an array structured as such
address[0] = HousNumber
address[1] = Street
address[2] = City
address[3] = State
address[4] = ZipCode
My major problem is how to account for the lack of a house number. with out having the whole array shift the data up one.
address[0] = Street
address[1] = City
address[2] = State
address[3] = ZipCode
[Edit]
For those that are wondering this is what i am doing atm . (cleaner version)
place = response.Placemark[0];
point = new GLatLng(place.Point.coordinates[1],place.Point.coordinates[0]);
FCmap.setCenter(point,12);
var a = place.address.split(',');
var e = a[2].split(" ");
var x = a[0].split(" ");
var hn = x.filter(function(item,index){
return index == 0;
});
var st = x.filter(function(item,index){
return index != 0;
});
var street = '';
st.each(function(item,index){street += item + ' ';});
results[0] = new Hash({
FullAddie: place.address,
HouseNum: hn[0],
Dir: '',
Street: street,
City: a[1],
State: e[1],
ZipCode: e[2],
GPoint: new GMarker(point),
Lat: place.Point.coordinates[1],
Lng: place.Point.coordinates[0]
});
// End Address Splitting
Reverse the string, do the split and then reverse each item.
Update: From the snippet you posted, it seems to me that you get the address from a Google GClientGeocoder Placemark. If that is correct, why are you getting the unstructured address (Placemark.address) instead of the structured one (Placemark.AddressDetails)? This would make your life easier, as you would have to try and parse only the ThoroughfareName, which is the street level part of the address, instead of having to parse everything else as well.
function get_address (addr_str) {
var m = /^(\d*)\s*([-\s\w.]+\s(?:St|Rd|Ave)\.?)\s+([-\s\w\.]+),\s*(\w+)\s+([-\d]+)$/i.exec(s);
var retval = m.slice(1);
if (!retval[0]) retval = retval.slice(1);
return retval;
}
Assume all streets ends with St, Rd or Ave.
var address = /[0-9]/.match(string.charAt(0))
? string.split(" ") : [ " "
].concat(string.split(" "));
This is not particularly robust, but it accounts for the two enumerated cases and is concise at only one line.
I've got a similar problem I'm trying to solve. It seems that if you look for the first space to the right of the house number, you can separate the house number from the street name.
Here in Boston you can have a house number that includes a letter! In addition, I've seen house numbers that include "1/2". Luckily, the 1/2 is preceded by a hyphen, so there aren't any embedded spaces in the house number. I don't know if that's a standard or if I'm just getting lucky.

Categories