I write this code to check if on my website is a text and if its not then it should send me a notification thru slack. When I run it on VSC it crushes after some time like maybe 15 min or something like that.
I want to make it nice to put it on a server and run it remotly but need to be sure that will not crash every so often. I want to use it to check some websites for changing information on them and if they will change or be gone then send me notification. Best bit it works but crashes and don't know why :(
Can someone maybe help or pinpoint what it can be the problem? It will be better that this tool can just see text instead of class but I don't know how to do that.
//Puppeteer library
const pt= require('puppeteer')
const axios = require('axios')
process.setMaxListeners(0);
async function getText(){
//launch browser in headless mode
const browser = await pt.launch()
//browser new page
const page = await browser.newPage()
//launch URL
await page.setDefaultNavigationTimeout(0);
//website
await page.goto('https://mieciusio.pl/kontakt.html')
//identify element
if (await page.$("[class='p-style btn-resize-mode label-bloc-2-style label-1-style']"))
console.log("found")
else //console.log("not found")
axios.post(' https://hooks.slack.com/services/MYUniqeID', {text: 'Its changed'})
}
setInterval(getText, 12000)
Try to find it online on YT but it's hard I was looking in a lot of tut's but can't find right one to work on finding text on website or not to crush because I don't know why crashes.
Im trying to set the binary path of a binary of chrome with selenium, in javascript language.
unfortunately, my knowledge in javascript is limited, and Im getting an error while trying to do so, in which I cannot solve, despite my efforts.
so without further ado, I will now share my problem, with the hope that someone with a better knowledge in javascript then me, will help me
some background:
Im triggering a function in the firebase could functions,
inside this function , I'm trying to create a selenium webdriver.
in order to do so:
I need to do those things:
chromedriver --> that work on a linux system(located inside the functions project folder)✅
chrome browser binary that is located on this machine ✅
3.then, I need to create a a chrome Options object.
a. adding an Argument so it will be headless.✅
b. setting it with a path to the chrome binary.❌
and at last, create a chrome driver with options, that I have created
currently, I'm at stage 3.b
the error that rise coming from my poor knowledge in javascript
this is the error :
TypeError: Assignment to constant variable.
here what's lead to this error
this is my code :
exports.initializedChromeDriver = functions.https.onRequest((request, response) => {
async function start_chrome_driver() {
try {
functions.logger.info('Hello logs!', {structuredData: true});
console.log("did enter the function")
const google_site = "https://www.gooogle.com";
const { WebDriver } = require('selenium-webdriver');
const {Builder, By} = require('selenium-webdriver');
console.log("will try to initialzed chrome");
let chrome = require('selenium-webdriver/chrome');
console.log("did initialzed chrome");
var chrome_options = new chrome.Options()
console.log("will try to set the chrome binary Path");
functions.logger.info('new chrome.Options()', {structuredData: true});
chrome_options = chrome_options.setChromeBinaryPath(path="/usr/bin/google-chrome");// <------- THIS IS THE LINE THAT RISE THE ERROR!
console.log("did setChromeBinaryPath");
chrome_options.addArguments("--headless");
let buillder = new Builder().forBrowser('chrome');
functions.logger.info(' did new Builder().forBrowser(chrome)', {structuredData: true});
const google_site = 'https://wwww.google.com'
await driver.get(google_site);
functions.logger.info('driver did open google site', {structuredData: true});
return "succ loading google"
}
catch (err) {
console.log('did catch')
console.error(err);
return "error loading google";
}
}
const p = start_chrome_driver().then((value,reject) => {
const dic = {};
dic['status'] = 200;
dic['data'] = {"message": value};
response.send(dic);
});
and here's the error that follows this code in the firebase functions logs:
I tried to change the chrome_options object into var/let, and looking for answers in the web , but after deploying again, and again, and again.. I feel like its time to get another perspective, any help will do.
You have an unnecessary assignment here (path=...)
chrome_options = chrome_options.setChromeBinaryPath(path="/usr/bin/google-chrome");
Just remove the assignment to path
chrome_options = chrome_options.setChromeBinaryPath("/usr/bin/google-chrome");
I'm currently building an app for web automation using AngleSharp. I've managed to log in to a website, but can't find the element I'm looking for using context.Active.QuerySelectorAll.
I understand that this is likely because some of the JavaScript hasn't run in the HTML I'm searching through, as per this link: Is the HTML shown via 'View Source' different from the HTML shown in (Firebug) developer tools?
How do I force AngleSharp to execute all of the JavaScript before I look for the specific element?
code:
var config = AngleSharp.Configuration.Default.WithDefaultLoader().WithCookies().WithJavaScript().WithCss();
var browsingContext = BrowsingContext.New(config);
await browsingContext.OpenAsync("https://users.premierleague.com/");
await browsingContext.Active.QuerySelector<IHtmlFormElement>("form[action='/accounts/login/']").SubmitAsync(new
{
login = "abc#gmail.com",
password = "password"
});
await browsingContext.OpenAsync("https://fantasy.premierleague.com/a/team/my/");
Everything works fine up until this point, and I can confirm that I am logged in. However, I then can't seem to get a value returned for the following:
var x = browsingContext.Active.QuerySelectorAll("*").Where(m => m.ClassName == "ismjs-link ism-link ism-link--more");
And I know that this element exists, as I've checked numerous times through the "Inspect" functionality available on Google Chrome.
What am I missing/ how do I get the JavaScript to run?
Thanks!
I am wondering how would I go abouts in detecting search crawlers? The reason I ask is because I want to suppress certain JavaScript calls if the user agent is a bot.
I have found an example of how to to detect a certain browser, but am unable to find examples of how to detect a search crawler:
/MSIE (\d+\.\d+);/.test(navigator.userAgent); //test for MSIE x.x
Example of search crawlers I want to block:
Google
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Googlebot/2.1 (+http://www.googlebot.com/bot.html)
Googlebot/2.1 (+http://www.google.com/bot.html)
Baidu
Baiduspider+(+http://www.baidu.com/search/spider_jp.html)
Baiduspider+(+http://www.baidu.com/search/spider.htm)
BaiDuSpider
This is the regex the ruby UA agent_orange library uses to test if a userAgent looks to be a bot. You can narrow it down for specific bots by referencing the bot userAgent list here:
/bot|crawler|spider|crawling/i
For example you have some object, util.browser, you can store what type of device a user is on:
util.browser = {
bot: /bot|googlebot|crawler|spider|robot|crawling/i.test(navigator.userAgent),
mobile: ...,
desktop: ...
}
Try this. It's based on the crawlers list on available on https://github.com/monperrus/crawler-user-agents
var botPattern = "(googlebot\/|bot|Googlebot-Mobile|Googlebot-Image|Google favicon|Mediapartners-Google|bingbot|slurp|java|wget|curl|Commons-HttpClient|Python-urllib|libwww|httpunit|nutch|phpcrawl|msnbot|jyxobot|FAST-WebCrawler|FAST Enterprise Crawler|biglotron|teoma|convera|seekbot|gigablast|exabot|ngbot|ia_archiver|GingerCrawler|webmon |httrack|webcrawler|grub.org|UsineNouvelleCrawler|antibot|netresearchserver|speedy|fluffy|bibnum.bnf|findlink|msrbot|panscient|yacybot|AISearchBot|IOI|ips-agent|tagoobot|MJ12bot|dotbot|woriobot|yanga|buzzbot|mlbot|yandexbot|purebot|Linguee Bot|Voyager|CyberPatrol|voilabot|baiduspider|citeseerxbot|spbot|twengabot|postrank|turnitinbot|scribdbot|page2rss|sitebot|linkdex|Adidxbot|blekkobot|ezooms|dotbot|Mail.RU_Bot|discobot|heritrix|findthatfile|europarchive.org|NerdByNature.Bot|sistrix crawler|ahrefsbot|Aboundex|domaincrawler|wbsearchbot|summify|ccbot|edisterbot|seznambot|ec2linkfinder|gslfbot|aihitbot|intelium_bot|facebookexternalhit|yeti|RetrevoPageAnalyzer|lb-spider|sogou|lssbot|careerbot|wotbox|wocbot|ichiro|DuckDuckBot|lssrocketcrawler|drupact|webcompanycrawler|acoonbot|openindexspider|gnam gnam spider|web-archive-net.com.bot|backlinkcrawler|coccoc|integromedb|content crawler spider|toplistbot|seokicks-robot|it2media-domain-crawler|ip-web-crawler.com|siteexplorer.info|elisabot|proximic|changedetection|blexbot|arabot|WeSEE:Search|niki-bot|CrystalSemanticsBot|rogerbot|360Spider|psbot|InterfaxScanBot|Lipperhey SEO Service|CC Metadata Scaper|g00g1e.net|GrapeshotCrawler|urlappendbot|brainobot|fr-crawler|binlar|SimpleCrawler|Livelapbot|Twitterbot|cXensebot|smtbot|bnf.fr_bot|A6-Indexer|ADmantX|Facebot|Twitterbot|OrangeBot|memorybot|AdvBot|MegaIndex|SemanticScholarBot|ltx71|nerdybot|xovibot|BUbiNG|Qwantify|archive.org_bot|Applebot|TweetmemeBot|crawler4j|findxbot|SemrushBot|yoozBot|lipperhey|y!j-asr|Domain Re-Animator Bot|AddThis)";
var re = new RegExp(botPattern, 'i');
var userAgent = navigator.userAgent;
if (re.test(userAgent)) {
console.log('the user agent is a crawler!');
}
The following regex will match the biggest search engines according to this post.
/bot|google|baidu|bing|msn|teoma|slurp|yandex/i
.test(navigator.userAgent)
The matches search engines are:
Baidu
Bingbot/MSN
DuckDuckGo (duckduckbot)
Google
Teoma
Yahoo!
Yandex
Additionally, I've added bot as a catchall for smaller crawlers/bots.
This might help to detect the robots user agents while also keeping things more organized:
Javascript
const detectRobot = (userAgent) => {
const robots = new RegExp([
/bot/,/spider/,/crawl/, // GENERAL TERMS
/APIs-Google/,/AdsBot/,/Googlebot/, // GOOGLE ROBOTS
/mediapartners/,/Google Favicon/,
/FeedFetcher/,/Google-Read-Aloud/,
/DuplexWeb-Google/,/googleweblight/,
/bing/,/yandex/,/baidu/,/duckduck/,/yahoo/, // OTHER ENGINES
/ecosia/,/ia_archiver/,
/facebook/,/instagram/,/pinterest/,/reddit/, // SOCIAL MEDIA
/slack/,/twitter/,/whatsapp/,/youtube/,
/semrush/, // OTHER
].map((r) => r.source).join("|"),"i"); // BUILD REGEXP + "i" FLAG
return robots.test(userAgent);
};
Typescript
const detectRobot = (userAgent: string): boolean => {
const robots = new RegExp(([
/bot/,/spider/,/crawl/, // GENERAL TERMS
/APIs-Google/,/AdsBot/,/Googlebot/, // GOOGLE ROBOTS
/mediapartners/,/Google Favicon/,
/FeedFetcher/,/Google-Read-Aloud/,
/DuplexWeb-Google/,/googleweblight/,
/bing/,/yandex/,/baidu/,/duckduck/,/yahoo/, // OTHER ENGINES
/ecosia/,/ia_archiver/,
/facebook/,/instagram/,/pinterest/,/reddit/, // SOCIAL MEDIA
/slack/,/twitter/,/whatsapp/,/youtube/,
/semrush/, // OTHER
] as RegExp[]).map((r) => r.source).join("|"),"i"); // BUILD REGEXP + "i" FLAG
return robots.test(userAgent);
};
Use on server:
const userAgent = req.get('user-agent');
const isRobot = detectRobot(userAgent);
Use on "client" / some phantom browser a bot might be using:
const userAgent = navigator.userAgent;
const isRobot = detectRobot(userAgent);
Overview of Google crawlers:
https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers
isTrusted property could help you.
The isTrusted read-only property of the Event interface is a Boolean
that is true when the event was generated by a user action, and false
when the event was created or modified by a script or dispatched via
EventTarget.dispatchEvent().
eg:
isCrawler() {
return event.isTrusted;
}
⚠ Note that IE isn't compatible.
Read more from doc: https://developer.mozilla.org/en-US/docs/Web/API/Event/isTrusted
People might light to check out the new navigator.webdriver property, which allows bots to inform you that they are bots:
https://developer.mozilla.org/en-US/docs/Web/API/Navigator/webdriver
The webdriver read-only property of the navigator interface indicates whether the user agent is controlled by automation.
It defines a standard way for co-operating user agents to inform the document that it is controlled by WebDriver, for example, so that alternate code paths can be triggered during automation.
It is supported by all major browsers and respected by major browser automation software like Puppeteer. Users of automation software can of course disable it, and so it should only be used to detect "good" bots.
I combined some of the above and removed some redundancy. I use this in .htaccess on a semi-private site:
(google|bot|crawl|spider|slurp|baidu|bing|msn|teoma|yandex|java|wget|curl|Commons-HttpClient|Python-urllib|libwww|httpunit|nutch|biglotron|convera|gigablast|archive|webmon|httrack|grub|netresearchserver|speedy|fluffy|bibnum|findlink|panscient|IOI|ips-agent|yanga|Voyager|CyberPatrol|postrank|page2rss|linkdex|ezooms|heritrix|findthatfile|Aboundex|summify|ec2linkfinder|facebook|slack|instagram|pinterest|reddit|twitter|whatsapp|yeti|RetrevoPageAnalyzer|sogou|wotbox|ichiro|drupact|coccoc|integromedb|siteexplorer|proximic|changedetection|WeSEE|scrape|scaper|g00g1e|binlar|indexer|MegaIndex|ltx71|BUbiNG|Qwantify|lipperhey|y!j-asr|AddThis)
The "test for MSIE x.x" example is just code for testing the userAgent against a Regular Expression. In your example the Regexp is the
/MSIE (\d+\.\d+);/
part. Just replace it with your own Regexp you want to test the user agent against. It would be something like
/Google|Baidu|Baiduspider/.test(navigator.userAgent)
where the vertical bar is the "or" operator to match the user agent against all of your mentioned robots. For more information about Regular Expression you can refer to this site since javascript uses perl-style RegExp.
I found this isbot package that has the built-in isbot() function. It seams to me that the package is properly maintained and that they keep everything up-to-date.
USAGE:
const isBot = require('isbot');
...
isBot(req.get('user-agent'));
Package: https://www.npmjs.com/package/isbot
I'm using the Firefox Addon SDK to build something that monitors and displays the HTTP traffic in the browser. Similar to HTTPFox or Live HTTP Headers. I am interested in identifying which tab in the browser (if any) generated the request
Using the observer-service I am monitoring for "http-on-examine-response" events. I have code like the following to identify the nsIDomWindow that generated the request:
const observer = require("observer-service"),
{Ci} = require("chrome");
function getTabFromChannel(channel) {
try {
var noteCB= channel.notificationCallbacks ? channel.notificationCallbacks : channel.loadGroup.notificationCallbacks;
if (!noteCB) { return null; }
var domWin = noteCB.getInterface(Ci.nsIDOMWindow);
return domWin.top;
} catch (e) {
dump(e + "\n");
return null;
}
}
function logHTTPTraffic(sub, data) {
sub.QueryInterface(Ci.nsIHttpChannel);
var ab = getTabFromChannel(sub);
console.log(tab);
}
observer.add("http-on-examine-response", logHTTPTraffic);
Mostly cribbed from the documentation for how to identify the browser that generated the request. Some is also taken from the Google PageSpeed Firefox addon.
Is there a recommended or preferred way to go from the nsIDOMWindow object domWin to a tab element in the SDK tabs module?
I've considered something hacky like scanning the tabs list for one with a URL that matches the URL for domWin, but then I have to worry about multiple tabs having the same URL.
You have to keep using the internal packages. From what I can tell, getTabForWindow() function in api-utils/lib/tabs/tab.js package does exactly what you want. Untested code:
var tabsLib = require("sdk/tabs/tab.js");
return tabsLib.getTabForWindow(domWin.top);
The API has changed since this was originally asked/answered...
It should now (as of 1.15) be:
return require("sdk/tabs/utils").getTabForWindow(domWin.top);
As of Addon SDK version 1.13 change:
var tabsLib = require("tabs/tab.js");
to
var tabsLib = require("sdk/tabs/helpers.js");
If anyone still cares about this:
Although the Addon SDK is being deprecated in support of the newer WebExtensions API, I want to point out that
var a_tab = require("sdk/tabs/utils").getTabForContentWindow(window)
returns a different 'tab' object than the one you would typically get by using
worker.tab in a PageMod.
For example, a_tab will not have the 'id' attribute, but would have linkedPanel property that's similar to the 'id' attribute.