I'm working on a scraper for The List as a JS project, and my regex-fu could be better than it is.
Given a data structure like
<a name="may_21"><b>Wed May 21</b></a>
<ul>
<li><b>Ace of Spades, Sacramento</b> Christina Perri, Birdy a/a $20 7pm **
...
</ul>
I've written the following to leverage cheerio to grab a date, venue, and list of bands:
request(url, (error, response, html)->
if(!error)
$ = cheerio.load(html)
concert = { bands : {}, location : {venue: "", address : ""}, date: {date: "", time: ""}}
calendar = {}
dates = []
#grab dates
$('body > ul > li > a').each(->
data = $(this)
$dates = data.children().first()
dates.push($dates.text())
)
#build concerts
for date in dates
$("a:contains('" + date + "')").siblings().each(->
$venue = $(this).children().find("b")
$bands = $venue.siblings("a")
$time = $venue.parent()#.match()
)
)
As you can see, I'm having trouble figuring out how to grab the time from the above structure.
Typically, that is going to be a bit of plain text at the end of a li that corresponds to a specific show, so that for something like
Bottom of the Hill, S.F. Matt Pond PA, Lighthouse And The Whaler, Kyle M. Terrizzi a/a $14/$16 8pm/9pm **
I would be looking to grab the "8pm/9pm" text out of
<li><b>Bottom of the Hill, S.F.</b> Matt Pond PA, Lighthouse And The Whaler, Kyle M. Terrizzi a/a $14/$16 8pm/9pm **
Sometimes it will be in the form of "8pm", sometimes "8pm/9m" and sometimes it won't be there at all.
What's the best way to structure a regex to grab this data?
Don't regex the full raw html (general advice).
Instead, try loading the html to a temporary container-div (or documentFragment but you need some custom basic getter-shims).
Now work your way (loop) through the known structure, discarding everything you don't need (like anchors) and finally loop through the containers (in what's left over) to grab your final data (using a much simpler regex, that matches for example: /(\d+[ap]m/?){1,2}$/i.
PS, a word from a scraper: you often only know your final routine once you fully and successfully completed your scrape! (Like you usually find lost stuff in the last place you look..).
As Tomalak commented: pitfall no 1: data that doesn't match what you anticipate. Try to research your expected data-formats!!
EDIT:
Extra advice: add as much error-checking you can. Try to translate every flaw you find during testing to a check. You NEED any help you can get once you start scraping massive amounts of data.
Consider a chunking-approach : If a check fails, you don't need to start over from the beginning of the data. Instead, add extra check/solution and continue your scrape.
Otherwise just testing/debugging your scraper might even look like DOS behavior/traffic.
got this working, here's the code that I ended up using
fs = require('fs')
request = require('request')
cheerio = require('cheerio')
crypto = require("crypto")
url = 'http://www.foopee.com/punk/the-list/by-date.0.html'
getConcertItem = (text, regex)->
return text.match(regex)?.toString().replace(/,/g, '').trim()
request(url, (error, response, html)->
if(!error)
$ = cheerio.load(html)
#print(html)
calendar = {}
$dates = $('body > ul > li')
#dates
$dates.each(->
date = $(this).find("a").first().text()
$concerts = $(this).children("ul").children()
$concerts.each( ->
#todo: use the import-style ID generator
ID = parseInt(crypto.randomBytes(4).toString('hex'), 16)
concert = {bands : [], location : {venue: "", address : ""}, date: {date: "", time: ""}, cost: "", allAges: false}
$venue = $(this).find("b")
concert.location.venue = $venue.text()
concertText = $venue.parent().clone().children().remove().end().text()
timeRegex = /(\d?:?\d+[ap]m\/?\s?\w*\s?)/g
concert.date.date = date
concert.date.time = getConcertItem(concertText, timeRegex)
costRegex = /(\$\d+[\/-]?)/g
concert.cost = getConcertItem(concertText, costRegex)
allAgesRegex = /(a\/a)/g
if getConcertItem(concertText, allAgesRegex)
concert.allAges = true
$bands = $venue.siblings()
bands = []
$bands.each( ->
band = $(this).text()
bands.push(band)
)
concert.bands = bands
calendar[ID] = concert
)
)
)
Related
I am trying to get message log from Azure application Insight like this
az monitor app-insights --app [app id] --analystics-query [condition like specific message id]
Then I got a message like this
"message": [
"Receiving message: {"type":"CTL","traceId":"f0d11b3dbf27b8fc57ac0e40c4ed9e48","spanId":"a5508acb0926fb1a","id":{"global":"GLkELDUjcRpP4srUt9yngY","caller":null,"local":"GLkELDUisjnGrSK5wKybht"},"eventVersion":"format version","timeStamp":"2021-10-01T14:55:59.8168722+07:00","eventMetadata":{"deleteTimeStamp":null,"ttlSeconds":null,"isFcra":null,"isDppa":true,"isCCPA":true,"globalProductId":null,"globalSubProductId":null,"mbsiProductId":null},"eventBody":{"sys":"otel","msg":"Testing Centralized Event Publisher with App1 (using logback)","app":{"name":"otel","service":"postHouse","status":"status name","method":"POST","protocol":"HTTP","resp_time_ms":"250","status_code":"4"},}}"
] }
So that I would like to apply Regular Expression for this message to get only the message from {"type.....to "status_code":"4"},}} and also convert it to JSON format
I have code like this in my .js file
Then('extract json from {string}', function(message){
message = getVal(message, this);
const getmess = message.match(/{(.*)}/g);
const messJson = JSON.parse(getmess);
console.log(messJson);
})
But it doesn't work for me
SyntaxError: Unexpected token \ in JSON at position 1
How can I apply this in my code on Javascript? Thank you so much for your help
Try this. But keep in mind, that current regex is binded with provided program output syntax. If output will be different in wrapper structure, this regex might not work any more.
// Text from app
const STDOUT = `
"message": [ "Receiving message: {"type":"CTL","traceId":"f0d11b3dbf27b8fc57ac0e40c4ed9e48","spanId":"a5508acb0926fb1a","id":{"global":"GLkELDUjcRpP4srUt9yngY","caller":null,"local":"GLkELDUisjnGrSK5wKybht"},"eventVersion":"format version","timeStamp":"2021-10-01T14:55:59.8168722+07:00","eventMetadata":{"deleteTimeStamp":null,"ttlSeconds":null,"isFcra":null,"isDppa":true,"isCCPA":true,"globalProductId":null,"globalSubProductId":null,"mbsiProductId":null},"eventBody":{"sys":"otel","msg":"Testing Centralized Event Publisher with App1 (using logback)","app":{"name":"otel","service":"postHouse","status":"status name","method":"POST","protocol":"HTTP","resp_time_ms":"250","status_code":"4"},}}"
] }
`;
// Match JSON part string
let JSONstr = /.*\[\s*\"Receiving message:\s*(.*?)\s*\"\s*]\s*}\s*$/.exec(STDOUT)[1];
// Remove trailing comma(s)
JSONstr = JSONstr.replace(/^(.*\")([^\"]+)$/, (s, m1, m2) => `${m1}${m2.replace(/\,/, "")}`);
// Convert to object
const JSONobj = JSON.parse(JSONstr);
// Result
console.log(JSONobj);
Try this one:
/.*?({"type":.*?,"status_code":"\d+"\})/
When used in Javascript, the part covered by the parentheses counts as Group 1, i.e.,:
const messJson = JSON.parse(message.match(/.*?({"type":.*?,"status_code":"\d+"\})/)[1]);
Reference here: https://regexr.com/66mf2
am trying to replace numbers in an array but am facing an issue which am not really able to correctly manage regarding how to correctly target the just one data I really have to change.
I'll make an example to have more accuracy on describing it.
Imagine my data array look like that:
["data", "phone numbers", "address"]
I can change numbers via following script but my first problem is that it makes no differences between the number it find in columns, for example "phone numbers" from "address" (at the moment am not using it, but should I include a ZIP code in the address it would be really be a problem)
Beside, my second and current problem with my script, is that obviosuly in the same "phone numnbers" a number may appear more times while I'd like to affect only the first block of the data - let's say to add/remove the country code (or even replace it with it's country vexillum) which I normally have like that "+1 0000000000" or "+54 0000000000"
So if a number is for example located in EU it really make this script useless: Spain is using "+34" while France "+33" and it wouldn't succeded in any case becouse it recognize only "+3" for both.
I've found some one else already facing this problems which seems to solved it wrapping the values inside a buondaries - for example like that "\b"constant"\b" - but either am wronging syntax either it does not really apply to my case. Others suggest to use forEach or Array.prototype.every which I failed to understand how to apply at this case.
Should you have other ideas about that am open to try it!
function phoneUPDATES(val)
{
var i= 0;
var array3 = val.value.split("\n");
for ( i = 0; i < array3.length; ++i) {
array3[i] = "+" + array3[i];
}
var arrayLINES = array3.join("\n");
const zero = "0";
const replaceZERO = "0";
const one = "1";
const replaceONE = "1";
const result0 = arrayLINES.replaceAll(zero, replaceZERO);
const result1 = result0.replaceAll(one, replaceONE);
const result2 = result1.replaceAll(two, replaceTWO);
const result3 = result2.replaceAll(thre, replaceTHREE);
const result4 = result3.replaceAll(four, replaceFOUR);
const result5 = result4.replaceAll(five, replaceFIVE);
const result6 = result5.replaceAll(six, replaceSIX);
const result7 = result6.replaceAll(seven, replaceSEVEN);
const result8 = result7.replaceAll(eight, replaceEIGHT);
const result9 = result8.replaceAll(nine, replaceNINE);
const result10 = result9.replaceAll(ten, replaceTEN);
const result11 = result10.replaceAll(eleven, replaceELEVEN);
Why not use a regex replace, you could do something like /(\+\d+ )/g which will find a + followed by one or more digits followed by a space, and then you can strip out the match:
const phoneNumbers = [, "+54 9876543210"]
console.log(phoneNumbers.map((num) => num.replaceAll(/(\+\d+ )/g, '')))
If you need to only target the second element in an array, i'd imagine your data looks like
const data = [["data", "+1 1234567890, +1 5555555555", "address"], ["data", "+11 111111111, +23 23232323", "address"]];
console.log(data.map((el) => {
el[1] = el[1].replaceAll(/(\+\d+ )/g, '');
return el;
}))
ok, this almost is cheating but I really didn't thought it before and, by the way does, not even actually solve the problems but jsut seems to work around it.
If I call the replacemente in decreasing order that problem just does not show up becouse condition of replacement involving higher numbers are matched before the smaller one.
but should some one suggest a complete "true code comply" solution is wellcome
I've been trying for hours to make the following Google Apps Script work. What it needs to do, is send emails (from an html-template) to anyone that:
has a complete Event Schedule (which is completed if they have been
assigned to at least 4 events, which is counted in column Q);
has NOT been sent an email earlier (which is kept track of in column
R);
The script keeps track of errors in column S, i.e. if there's no email address provided.
It appears it only works:
if I comment out
data = data.filter(function(r){ return r[17] == true & r[16] > 3});
or if I comment out
ws.getRange("S3:S" + ws.getLastRow()).setValues(errors);
ws.getRange("R3:R" + ws.getLastRow()).setValues(mailSucces);
How can I get this script to work properly?
A copy of the Google Sheet I'm referring to is this one:
https://docs.google.com/spreadsheets/d/1sbOlvLVVfiQMWxNZmtCLuizci2cQB9Kfd8tYz64gjP0/edit?usp=sharing
This is my code so far:
function SendEmail(){
var voornaam = 3;
var achternaam = 4;
var email = 5;
var event1 = 9;
var event2 = 10;
var event3 = 11;
var event4 = 12;
var event5 = 13;
var event6 = 14;
var event7 = 15;
var emailTemp = HtmlService.createTemplateFromFile("email");
var ws = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Events Day 1");
var datum = ws.getRange(1,3).getValue();
var spreker = ws.getRange(1,6).getValue();
var data = ws.getRange("A3:R" + ws.getLastRow()).getValues();
data = data.filter(function(r){ return r[17] == false && r[16] > 3}); //Either this needs to be commented out...
let errors = [];
let mailSucces = [];
data.forEach(function(row){
try{
emailTemp.voornaam = row[voornaam];
emailTemp.email = row[email];
emailTemp.datum = datum;
emailTemp.spreker = spreker;
emailTemp.event1 = row[event1];
emailTemp.event2 = row[event2];
emailTemp.event3 = row[event3];
emailTemp.event4 = row[event4];
emailTemp.event5 = row[event5];
emailTemp.event6 = row[event6];
emailTemp.event7 = row[event7];
var htmlMessage = emailTemp.evaluate().getContent();
GmailApp.sendEmail(
row[email],
"Here you go! Your personal schedule for the event of " + datum,
"Your emailprogramm doesn't support html.",
{
name: "Event Organisation Team", htmlBody: htmlMessage, replyTo: "info#fakeemail.com"
});
errors.push([""]);
mailSucces.push(["TRUE"]);
}
catch(err){
errors.push(["Error: no message sent."]);
mailSucces.push(["False"]);
}
}); //close forEach
ws.getRange("S3:S" + ws.getLastRow()).setValues(errors); //or this and the next line need to be commented out.
ws.getRange("R3:R" + ws.getLastRow()).setValues(mailSucces);
}
Edit I have been trying and thinking en trying... but still haven't found out how to make it work. But I also got understanding of why it's not working; I just don't know how to get it fixed.
Let me elaborate on the problem a bit more:
The problem is, that within the forEach loop the range is a filtered variant of the data, pulled from the spreadsheet with getValues. Therefore, writing data back with ws.getRange("R3:R" + ws.getLastRow()).setValues(mailSucces); results in mismatched checkmarks in te spreadsheet.
So, somehow I need to put the range of the previous used filter data = data.filter(function(r){ return r[17] == false & r[16] > 3}); in a variable...? I guess?
Furthermore, I don't think it's wise to use setValue within the loop, because (from what I understand from my searching on the topic) this results in a slow script, because every loop the script makes an API call to write in the spreadsheet. Hence the errors.push and mailSucces.push, and my attempt to do a setValue at the end, after the loop is finished.
Can someone help me to finish this problem?
The problem is different size of the range you write to and data you are writing in.
Try replacing:
ws.getRange("S3:S" + ws.getLastRow()).setValues(errors);
ws.getRange("R3:R" + ws.getLastRow()).setValues(mailSucces);
With:
ws.getRange(3, 19, errors.length, 1).setValues(errors);
ws.getRange(3, 18, mailSucces.length, 1).setValues(mailSucces);
You should use this variation of getRange
https://developers.google.com/apps-script/reference/spreadsheet/sheet#getrangerow,-column,-numrows,-numcolumns
Your data has non-fixed number of rows and fixed number of columns (1). In general case your data will be matrix of X rows and Y columns. For that purpose you can make it completely dynamic:
sheet.getRange(startRow, startColumn, data.length, data[0].length)
Just make sure data.length is > 0 before you do this, otherwise data[0].length will break.
Edit:
I started writing a comment but it got too long. There are couple of things that may go wrong with sending emails. First thing I noticed is that you use & in filter, but in AppsScript/JavaScript/C-like-languages, you should use && for logical AND. Now the email: you only detect the code break with the catch block. At this point you don't know why the code breaks it could be anything. With GmailApp I recommend you to use createDraft while developing, then when all ok replace it with sendEmail for the final version, both functions have the exact same parameters, thank you Google devs ;-).
To find out the exact problem you should get the error message on break and display it. err.stack should tell you pretty much everything:
catch(err){
Logger.log(err.stack); // Added
errors.push(["Error: no message sent."]);
mailSucces.push(["False"]);
}
Run the sendEmail function from the code editor and you should see the Log for each catch(err) pass.
Test File
Sometimes, my lists of emails include duplicate addresses for the same person. For example, Jane's addresses are both "jane.doe#email.com" and "doe.jane#email". Her variants include replacing the "." with "-" or "_". At the moment, my duplicates script—upgraded ever so kindly by #Jordan Running and Ed Nelson—takes care of 'strict' duplicates, yet cannot detect that "doe.jane#email.com" is a 'complicated' duplicate of "jane.doe#email.com". Is there a way to delete even these duplicates such that I do not email more than one of Jane's addresses? All of them point to the same inbox, so I need only include one of her addresses.
Here is my current code:
function removeDuplicates() {
const startTime = new Date();
const newData = [];
const sheet = SpreadsheetApp.getActiveSheet();
const data = sheet.getDataRange().getValues();
const numRows = data.length;
const seen = {};
for (var i = 0, row, key; i < numRows && (row = data[i]); i++) {
key = JSON.stringify(row);
if (key in seen) {
continue;
}
seen[key] = true;
newData.push(row);
};
sheet.clearContents();
sheet.getRange(1, 1, newData.length, newData[0].length).setValues(newData);
// Show summary
const secs = (new Date() - startTime) / 1000;
SpreadsheetApp.getActiveSpreadsheet().toast(
Utilities.formatString('Processed %d rows in %.2f seconds (%.1f rows/sec); %d deleted',
numRows, secs, numRows / secs, numRows - newData.length),
'Remove duplicates', -1);
}
Sample File
Fuzzy match test
Notes:
used without #email.com part, it distorts the result
use a the custom function: =removeDuplicatesFuzzy(B2:B12,0.66)
0.66 is a percentage of fuzzy match.
the right column of a result (Column D) shows found matches with > 0.66 accuracies. Dash - is when matches are not found ("unique" values)
Background
You may try this library:
https://github.com/Glench/fuzzyset.js
To install it, copy the code from here.
The usage is simple:
function similar_test(string1, string2)
{
string1 = string1 || 'jane.doe#email.com';
string2 = string2 || 'doe.jane#email.com'
a = FuzzySet();
a.add(string1);
var result = a.get(string2);
Logger.log(result); // [[0.6666666666666667, jane.doe#email.com]]
return result[0][0]; // 0.6666666666666667
}
There's also more info here: https://glench.github.io/fuzzyset.js/
Notes:
please google more info, look for javascript fuzzy string match. Here's related Q: Javascript fuzzy search that makes sense. Note: the solution should work in Google Sheets (no ECMA-6)
this algorithm is not smart like a human, it tests a string by char. If you have two similar strings like don.jeans#email.com it will be 84% similar to doe.jane#email.com but human detects it is completely another person.
Search for my Google Sheets add-on called Flookup. It should do what you want.
For your case, you can use this function:
ULIST(colArray, [threshold])
The parameter details are:
colArray: the column from which unique values are to be returned.
threshold: the minimum percentage similarity between the colArray values that are not unique.
Or you can simply use the Highlight duplicates or Remove duplicates from the add-on menu.
The key feature is that you can adjust the level of strictness by changing the percentage similarity.
Bonus: It will easily catch swaps like jane.doe#email.com / doe.jane#email.com
You can find out more at the official website.
I was appointed the task of making a process in which a PowerShell script needs to make a call to Canvas servers in order to get data out of it for other uses that are outside the scope of this question.
The first thing I did was research how the Canvas API actually works. I eventually found this post holds everything I think I should know about the API. The API requires an HMAC SHA 256 hash.
I have decided to reverse engineer his the writer's code that makes the hash in order to make the same script in PowerShell.
Here is my slightly edited code (node.js)
var crypto = require('crypto')
var url = require('url')
var HMAC_ALG = 'sha256'
var apiAuth = module.exports = {
buildMessage: function(secret, timestamp, uri) {
var urlInfo = url.parse(uri, false);
var query = urlInfo.query ? urlInfo.query.split('&').sort().join('&') : '';
var parts = [ 'GET', urlInfo.host, '', '', urlInfo.pathname, query, timestamp, secret ]
console.log(parts);
return parts.join('\n');
},
buildHmacSig: function(secret, timestamp, reqOpts,message) {
//var message = apiAuth.buildMessage(secret, timestamp, reqOpts);
var hmac = crypto.createHmac(HMAC_ALG, new Buffer(secret))
hmac.update(message)
Console.log(message);
return hmac.digest('base64')
}
}
Here are the parameters that I put in the node js application
var canvas = require('[filepath]/new_canvas');
var secret = 'mycrazysecret';
var today = new Date();
var timestamp= today.toUTCString();
var regOpts = 'mycrazymessage';
var message = canvas.buildMessage(secret, timestamp, regOpts)
var hmac = canvas.buildHmacSig(secret, timestamp, regOpts,message);
the final code it this
'Oexq8/ulAGxSIQXGDVqoXyqk5x+n9cMrc3avcTW9aZk='
Here is my PowerShell file:
function buffer
{
param ($string)
$c=#()
Foreach ($element in $string.toCharArray()) {$c+= [System.Convert]::ToSByte($element)}
return $c
}
$message = 'GET\n\n\n\nmycrazymessage\n\nFri, 18 Nov 2016 15:29:52 GMT\nmycrazysecret'
$secret = 'mycrazysecret'
$hmacsha = New-Object System.Security.Cryptography.HMACSHA256
$hmacsha.key = buffer -string $secret
$signature = $hmacsha.ComputeHash([Text.Encoding]::UTF8.GetBytes($message))
$signature = [Convert]::ToBase64String($signature)
echo $signature
The final result is 'pF92zam81wclnnb8csDsscsSYNQ7it9qLrcJkRTi5rM='
I do not know getting the to produce the same results is even possible, but the question I am asking why aren't they producing to different results? (the keys are the same as well)
In PowerShell, the default escape sequence uses backticks ` rather than backslash \.
In order for the parser to recognize the escape sequence as not just a backtick character literal and the letter n, use an expandable string (" instead of '):
$message = "GET`n`n`n`nmycrazymessage`n`nFri, 18 Nov 2016 15:29:52 GMT`nmycrazysecret"
Other than that, your HMAC signature procedure is correct (it correctly outputs Oexq8/ulAGxSIQXGDVqoXyqk5x+n9cMrc3avcTW9aZk= after changing the $message value)