UWP Webview get links from webpage into an array - javascript

I can only find old outdated answers for before UWP Win 10. I know how to do it the old ways, but it is giving me problems.
What I have so far is below, note the problem seem to lie in the VB where it isn't doing the element by tag name command like I've been told it should. Change that to inner HTML though, and it will populate the html variable with the full page. So I just can't get the links themselves it seems.
Any help is appreciated! Thanks!
XAML
<Page
x:Class="webviewMessingAround.MainPage"
xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
xmlns:local="using:webviewMessingAround"
xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
mc:Ignorable="d">
<Grid Background="{ThemeResource ApplicationPageBackgroundThemeBrush}">
<Grid Background="{ThemeResource ApplicationPageBackgroundThemeBrush}">
<WebView x:Name="webview" Source="http://regsho.finra.org/regsho-December.html" DOMContentLoaded="WebView_DOMContentLoaded" />
<Button x:Name="button" HorizontalAlignment="Left" Margin="145,549,0,0" VerticalAlignment="Top">
<Button x:Name="button1" Click="button_Click" Content="Button" Height="58" Width="141"/>
</Button>
</Grid>
</Grid>
</Page>
VB Code
Private Async Sub webview_DOMContentLoaded(sender As WebView, args As WebViewDOMContentLoadedEventArgs) Handles webview.DOMContentLoaded
Dim html = Await webview.InvokeScriptAsync("eval", ({"document.getElementsByTagName('a');"}))
'Debug.WriteLine(html)
End Sub

The InvokeScriptAsync can only return the string result of the script invocation.
Return value
When this method returns, the string result of the script invocation.
So if you want get links form a web page, you need put all the links into a string to return. For a C# example:
string html = await webview.InvokeScriptAsync("eval", new string[] { "[].map.call(document.getElementsByTagName('a'), function(node){ return node.href; }).join('||');" });
System.Diagnostics.Debug.WriteLine(html);
Here I use
[].map.call(document.getElementsByTagName('a'), function(node){ return node.href; }).join('||');
to put all the links into a string. You may need to change this JavaScript code to implement your own.
After this you can split the string into a array like:
var links = html.Split(new[] { "||" }, StringSplitOptions.RemoveEmptyEntries);
Although I used C# for example, but the VB code should be similar.

Related

How can jQuery append() with backquote (``) miss/won't render a string variable?

I'm building several carousels on a webpage with jQuery by calling all the information I need from YouTube with the Youtube Data API v3.
After doing the designing and the functions I'm struggling with one simple thing that I cannot understand.
I use append(``) so that I can append all the HTML that I need to the element that I want, and also inserting other informations with the variables in the ${var} notation.
Everything works fine EXCEPT for a single string variable preview. It's like it not recognized as a variable and in the final output is rendered like a string chunck.
Now some code.
This is the preparation for calling the function that loads everything:
jQuery(document).ready(function () {
var apikey = 'my-api-key';
var URL = 'https://www.googleapis.com/youtube/v3/playlistItems';
var playlists = {
1: 'PL549CFEF61BF98279',
2: 'PLX_IxBH-yGtonzSE2zyplhI2oky7FWvbE',
3: 'PL038B3F56D598DD61',
4: 'PLDDFDDD10E5584056',
5: 'PLD4F65416EB11640F',
}
loadVids(apikey, URL, playlists);
});
Next loadVids, for every youtube playlist call getJSON() and retrieve the data:
function loadVids(apikey, URL, playlists) {
for (const menuid in playlists) {
var options = { part: 'snippet', key: apikey, maxResults: 20, playlistId: playlists[menuid] }
jQuery.getJSON(URL, options, function (data) {
resultsLoop(data, menuid, apikey);
});
}
}
then resultLoop using each() puts all the information inside some HTML to be appended somewhere in the webpage (i stripped all the original attributes to keep it readable).
function resultsLoop(data, menuid) {
jQuery.each(data.items, function () {
var alttext = this.snippet.title;
var title = alttext.substring(0, 57) + '…'
var vid = this.snippet.resourceId.videoId;
var preview = this.snippet.thumbnails.standard.url;
jQuery("#carousel-" + menuid + " ul")
.append(`
<li>
<article>
<div>
<a href="//www.youtube.com/watch?v=${vid}&fs=1&autoplay=0&rel=0">
<img alt="${alttext}" src="${preview}">
</a>
</div>
<div>
<h4>${title}</h4>
</div>
</article>
</li>
`);
});
}
At the end of it the <img> tag is
<img alt="some text" src="/$%7Bpreview%7D">
I tried to:
change the name of the variable
console logging it before, after append(), without issues
typeof says it's a normal string
it gives me the same result on every browser
I really don't understand what I'm doing wrong, and only preview doesn't work, all the other variables in the append() are working properly.
Why you are not using concat as you have already did for jQuery("#carousel-" + menuid + " ul") !!
Example: (Please use this code for append and check, I have used single quote and not backquote as it is not accepted by js validation)
jQuery("#carousel-" + menuid + " ul").append('<li><article><div><img alt="'+alttext+'" src="'+preview+'"></div><div><h4>'+title+'</h4></div></article></li>');
and remove all white spaces from the append string. I hope it is what looking for.
Just to let you know, all the above was working on a Joomla page.
Taking all the code, apart from the jQuery(document).ready(function(){...loadVids()...}, and putting it on a .js file resolved everything.
I think there is some filter that won't let you inject external resources like https://i.ytimg.com/vi/lmuUD9_eDnY/sddefault.jpg in the page with javascript alone (and that's clever), but the filter doesn't apply if you include a .js file within the website itself.
A mediocre workaround for a mediocre javascript code. Thanks to Rory in the comments that gave me some insight.

How can I escape a string to ensure that it is a valid string literal in JS source?

I have a Qt application that embeds a web browser (QWebEngineView). I would like to call a javascript function with a string argument from the C++ application. The means of doing this is calling
page()->runJavaScript("setContent(\"hello\");");
This works in simple cases. However, if I try and load, say, a C++ source file and use that as the parameter of setContent, this will break, because I can't simply assemble the string like this:
auto js = QString("setContent(\"%1\");").arg(fileStr);
I tried the following:
fileStr = fileStr.replace('"', "\\\"");
fileStr = fileStr.replace("\n", "\\n");
But apparently this could not escape the string, I get an error when I call this javascript. How can I universally escape a long string with newlines and possible special characters so that I can construct a valid js fragment like this?
So, after some research, I came across QWebChannel which is meant for bi-directional communication between the application and the hosted webpage. The imported qwebchannel.js in the examples can be found here. From there, this is what I did:
In C++:
auto channel = new QWebChannel(this);
page()->setWebChannel(channel);
channel->registerObject("doc", Doc);
In HTML/JS:
new QWebChannel(qt.webChannelTransport,
function(channel) {
var doc = channel.objects.doc; // this is "doc" from the registerObject call
editor.setValue(doc.text);
doc.textChanged.connect(updateText); // textChanged is a signal of the class of doc.
}
);
So, even though this does not directly answer the question, what is presented here can be used to achieve the same effect.

JavaScript Exception in HtmlUnit when clicking at google result page

I want to use HtmlUnit (v2.21) to get some search result pages from google. This requires me to click on "people also looked for" link when searching for a person (right side, see example link), which triggers some JavaScript and changes the content of the current page. But this gives me an JavaScript Wrapper Exception (see below).
Clickable example link: https://www.google.de/search?ie=UTF-8&safe=off&q=nicki+minaj
Simple TestCase with errors:
String url = "https://www.google.de/search?ie=UTF-8&safe=off&q=nicki+minaj";
WebClient client = new WebClient(BrowserVersion.BEST_SUPPORTED);
HtmlPage page = client.getPage(url);
HtmlElement link = page.getFirstByXPath("//a[#class='_Zjg']");
HtmlPage newPage = link.click(); //throws exception
this.storeResultFile(newPage.asXml(), "test");
client.close();
Result:
net.sourceforge.htmlunit.corejs.javascript.WrappedException: Wrapped java.lang.NullPointerException
at net.sourceforge.htmlunit.corejs.javascript.Context.throwAsScriptRuntimeEx(Context.java:2053)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.doProcessPostponedActions(JavaScriptEngine.java:947)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.processPostponedActions(JavaScriptEngine.java:1012)
at com.gargoylesoftware.htmlunit.html.DomElement.click(DomElement.java:799)
at com.gargoylesoftware.htmlunit.html.DomElement.click(DomElement.java:742)
at com.gargoylesoftware.htmlunit.html.DomElement.click(DomElement.java:689)
I stored the xml of the "page" object and made sure that the XPath expression is valid and has results.
Anybody got any ideas?
Looks like the JavaScript-Engine (based on Rhino) is very easy to upset and quits on some script-issues, where other browsers are still able to run the script.
I dont know if there is a mistake in the scripts from google, but these two lines solved it for me:
JavaScriptEngine engine = client.getJavaScriptEngine();
engine.holdPosponedActions();
Nevertheless, when running multiple htmlunit-objects in multiple threads it is still possible to get accross this error. This is more a workaround than a solution.

awesomium web scraping certain parts

I asked this earlier but I wanted to rephrase the question. I am trying to make a scraper for my project. I would like to have it display a certain part of a link. The only part of the link that changes is the number. This number is what I would like to scrape. The link looks like this:
<a href="/link/player.jsp?user=966354" target="_parent" "="">
As mentioned I am trying to scrap only the 966354 part of the link. I have tried several ways to do this but cant figure it out. When I add
<a href="/link/player.jsp?user="
to the code below it breaks
List<string> player = new List<string>();
string html = webControl2.ExecuteJavascriptWithResult("document.getElementsByTagName('a')[0].innerHTML");
MatchCollection m1 = Regex.Matches(html, "<a href=\\s*(.+?)\\s*</a>", RegexOptions.Singleline);
foreach (Match m in m1)
{
string players = m.Groups[1].Value;
player.Add(players);
}
listBox.DataSource = player;
So I removed it, it shows no errors until I go to run the program then I get this error:
"An unhandled exception of type 'System.InvalidOperationException' occurred in Awesomium.Windows.Forms.dll"
So I tried this and it some what works:
string html = webControl2.ExecuteJavascriptWithResult("document.getElementsByTagName('html')[0].innerHTML");
This code scraps but not the way I would like, Could someone lend a helping hand please.
I would use HtmlAgilityPack (install it via NuGet) and XPath queries to parse HTML.
Something like this:
string html = webControl2.ExecuteJavascriptWithResult("document.getElementsByTagName('html')[0].innerHTML");
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(html);
var playerIds = new List<string>();
var playerNodes = htmlDoc.DocumentNode.SelectNodes("//a[contains(#href, '/link/profile-view.jsp?user=')]");
if (playerNodes != null)
{
foreach (var playerNode in playerNodes)
{
string href = playerNode.Attributes["href"].Value;
var parts = href.Split(new char[] { '=' }, StringSplitOptions.RemoveEmptyEntries);
if (parts.Length > 1)
{
playerIds.Add(parts[1]);
}
}
id.DataSource = playerIds;
}
Also you may find these two simple helper classes useful: https://gist.github.com/AlexP11223/8286153
The first one is extension methods for WebView/WebControl and the second one has some static methods to generate JS code for retrieving elements (JSObject) by XPath + getting coordinates of JSObject)
Using a sample html file such as below, I was unable to duplicate the exception.
<html>
test
</html>
However, the javascript
document.getElementsByTagName('a')[0].innerHTML
will return "test" in my example. What you probably want is
document.getElementsByTagName('a')[0].href
which will return the href portion.
The 'innerHTML' property will return everything between the start and end tags (such as <html> </html>). This is probably the reason you have better success when getting the 'html' element - you end up parsing the entire <a> </a> link.
FYI, as a test you can use your browser to test out the javascript output.

Get Element available under Dev Tools -> Resources -> Frames

I'm trying to do this by using a Tampermonkey Script. However I'm open to new approaches...
What I want to do is extract some data (data-video), from a specific <div>. However this data is not available under the HTML code of the page, but it's available under Dev Tools -> Resources and then on Frames.
Anyone knows if it's possible to get that information available under DevTools? And how can I do that?
Comparative between the two pages can be found here: "Original HTML PAGE" and "HTML PAGE under DevTools"
On the first hyperlink the id=video-canvas cannot be seen, however it's on the <object type="application/x-shockwave-flash(...)
As you state in your question the data you're looking for is available in DevTools under the "Resources" tab in the "Frames" folder. What you are looking at there is the Source HTML, similar to View Source.
The code you want, is what is getting replaced. It appears the site is using the JW Player Plugin, which is replacing the <div id="video-canvas"> with the appropriate HTML for the device / browser detected to play the video. With all of my browsers on my Mac, they are being forced to use the Flash, even when it's disabled. When using my iPhone, which can't play flash , and inspecting the page it uses JW's own custom video element. It appears that it must be storing the file location in memory since it is not in the generated markup.
I am able to run through the console in the dev tools and access their JS class. It appears i can call jwplayer._tracker , which has an object b . Object b has an object AlWv3iHmEeOzwBIxOUCPzg This object seems to be consistent each time i check between different browsers, you can use the for loop inmy first example to get the correct value but tirmming it down to .b Following that object is e and in e is the object http://i.n.jwpltx.com/v1.... really long string that appears to contain a url, so it will need to parsed.
So to get the HTML string i ran
for ( var loc in jwplayer._tracker.b.AlWv3iHmEeOzwBIxOUCPzg.e){
loc
}
so if we put that in a function to parse the string and return a value
function getSubURL(){
var initURL;
for ( var loc in jwplayer._tracker.b.AlWv3iHmEeOzwBIxOUCPzg.e){
initURL = loc;
}
//look for 'mp4:' this is in front of the file path
var start = initURL.indexOf("mp4%3A");
//look for the .mp4 for the end of the file name
var stop = initURL.indexOf(".mp4");
//grab the string between
//start+6 to remove characters used to find it
//and stop+4 to include characters used to find it
var subPath = (initURL.substring((start+6),(stop+4))).split("%2F").join("/");
return subPath;
}
//and run it
getSubURL();
it will return ciencia/astronomia/fimsol.mp4
you can run this from your console, but I am unaware of how you can use this in Tamper Monkey, but i think it gets ya a lot closer to what you wanted.
This is the approach I've used to solve my problem... I couldn't grab the code I want under Dev Tools, but I find a way to get the data from jwplayer with the function getPlaylistItem. And this is how I get the url filename of each video:
function getFilename(filename) {
var filename;
if(jwplayer().getPlaylistItem){
filename = jwplayer().getPlaylistItem()['file'];
}
else{
return filename;
}
filename = filename.substring(filename.indexOf("/mp4:") + 5);
return filename;
}

Categories