awesomium web scraping certain parts

awesomium web scraping certain parts - javascript

I asked this earlier but I wanted to rephrase the question. I am trying to make a scraper for my project. I would like to have it display a certain part of a link. The only part of the link that changes is the number. This number is what I would like to scrape. The link looks like this:
<a href="/link/player.jsp?user=966354" target="_parent" "="">
As mentioned I am trying to scrap only the 966354 part of the link. I have tried several ways to do this but cant figure it out. When I add
<a href="/link/player.jsp?user="
to the code below it breaks
List<string> player = new List<string>();
string html = webControl2.ExecuteJavascriptWithResult("document.getElementsByTagName('a')[0].innerHTML");
MatchCollection m1 = Regex.Matches(html, "<a href=\\s*(.+?)\\s*</a>", RegexOptions.Singleline);
foreach (Match m in m1)
{
string players = m.Groups[1].Value;
player.Add(players);
}
listBox.DataSource = player;
So I removed it, it shows no errors until I go to run the program then I get this error:
"An unhandled exception of type 'System.InvalidOperationException' occurred in Awesomium.Windows.Forms.dll"
So I tried this and it some what works:
string html = webControl2.ExecuteJavascriptWithResult("document.getElementsByTagName('html')[0].innerHTML");
This code scraps but not the way I would like, Could someone lend a helping hand please.

I would use HtmlAgilityPack (install it via NuGet) and XPath queries to parse HTML.
Something like this:
string html = webControl2.ExecuteJavascriptWithResult("document.getElementsByTagName('html')[0].innerHTML");
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(html);
var playerIds = new List<string>();
var playerNodes = htmlDoc.DocumentNode.SelectNodes("//a[contains(#href, '/link/profile-view.jsp?user=')]");
if (playerNodes != null)
{
foreach (var playerNode in playerNodes)
{
string href = playerNode.Attributes["href"].Value;
var parts = href.Split(new char[] { '=' }, StringSplitOptions.RemoveEmptyEntries);
if (parts.Length > 1)
{
playerIds.Add(parts[1]);
}
}
id.DataSource = playerIds;
}
Also you may find these two simple helper classes useful: https://gist.github.com/AlexP11223/8286153
The first one is extension methods for WebView/WebControl and the second one has some static methods to generate JS code for retrieving elements (JSObject) by XPath + getting coordinates of JSObject)

Using a sample html file such as below, I was unable to duplicate the exception.
<html>
test
</html>
However, the javascript
document.getElementsByTagName('a')[0].innerHTML
will return "test" in my example. What you probably want is
document.getElementsByTagName('a')[0].href
which will return the href portion.
The 'innerHTML' property will return everything between the start and end tags (such as <html> </html>). This is probably the reason you have better success when getting the 'html' element - you end up parsing the entire <a> </a> link.
FYI, as a test you can use your browser to test out the javascript output.

Related

How can jQuery append() with backquote (``) miss/won't render a string variable?

I'm building several carousels on a webpage with jQuery by calling all the information I need from YouTube with the Youtube Data API v3.
After doing the designing and the functions I'm struggling with one simple thing that I cannot understand.
I use append(``) so that I can append all the HTML that I need to the element that I want, and also inserting other informations with the variables in the ${var} notation.
Everything works fine EXCEPT for a single string variable preview. It's like it not recognized as a variable and in the final output is rendered like a string chunck.
Now some code.
This is the preparation for calling the function that loads everything:
jQuery(document).ready(function () {
var apikey = 'my-api-key';
var URL = 'https://www.googleapis.com/youtube/v3/playlistItems';
var playlists = {
1: 'PL549CFEF61BF98279',
2: 'PLX_IxBH-yGtonzSE2zyplhI2oky7FWvbE',
3: 'PL038B3F56D598DD61',
4: 'PLDDFDDD10E5584056',
5: 'PLD4F65416EB11640F',
}
loadVids(apikey, URL, playlists);
});
Next loadVids, for every youtube playlist call getJSON() and retrieve the data:
function loadVids(apikey, URL, playlists) {
for (const menuid in playlists) {
var options = { part: 'snippet', key: apikey, maxResults: 20, playlistId: playlists[menuid] }
jQuery.getJSON(URL, options, function (data) {
resultsLoop(data, menuid, apikey);
});
}
}
then resultLoop using each() puts all the information inside some HTML to be appended somewhere in the webpage (i stripped all the original attributes to keep it readable).
function resultsLoop(data, menuid) {
jQuery.each(data.items, function () {
var alttext = this.snippet.title;
var title = alttext.substring(0, 57) + '…'
var vid = this.snippet.resourceId.videoId;
var preview = this.snippet.thumbnails.standard.url;
jQuery("#carousel-" + menuid + " ul")
.append(`
<li>
<article>
<div>
<a href="//www.youtube.com/watch?v=${vid}&fs=1&autoplay=0&rel=0">
<img alt="${alttext}" src="${preview}">
</a>
</div>
<div>
<h4>${title}</h4>
</div>
</article>
</li>
`);
});
}
At the end of it the <img> tag is
<img alt="some text" src="/$%7Bpreview%7D">
I tried to:
change the name of the variable
console logging it before, after append(), without issues
typeof says it's a normal string
it gives me the same result on every browser
I really don't understand what I'm doing wrong, and only preview doesn't work, all the other variables in the append() are working properly.

Why you are not using concat as you have already did for jQuery("#carousel-" + menuid + " ul") !!
Example: (Please use this code for append and check, I have used single quote and not backquote as it is not accepted by js validation)
jQuery("#carousel-" + menuid + " ul").append('<li><article><div><img alt="'+alttext+'" src="'+preview+'"></div><div><h4>'+title+'</h4></div></article></li>');
and remove all white spaces from the append string. I hope it is what looking for.

Just to let you know, all the above was working on a Joomla page.
Taking all the code, apart from the jQuery(document).ready(function(){...loadVids()...}, and putting it on a .js file resolved everything.
I think there is some filter that won't let you inject external resources like https://i.ytimg.com/vi/lmuUD9_eDnY/sddefault.jpg in the page with javascript alone (and that's clever), but the filter doesn't apply if you include a .js file within the website itself.
A mediocre workaround for a mediocre javascript code. Thanks to Rory in the comments that gave me some insight.

Is there a way to get the page count of a word doc?

Preferably I would like to do this in the browser with javascript. I am already able to unzip the doc file and read the xml files but can't seem to find a way to get a page count. I am hoping the property exist in the xml files I just need to find it.
edit: I wouldn't say it is a duplicate of Is there a way to count doc, docx, pdf pages with only js (without Node.js)? My question is specific to word doc/docx files and that question was never resolved.

Found a way to do this with docx4js
Here is a small sample parsing file from input elem
import docx4js from 'docx4js';
docx4js.load(file).then(doc => {
const propsAppRaw = doc.parts['docProps/app.xml']._data.getContent();
const propsApp = new TextDecoder('utf-8').decode(propsAppRaw);
const match = propsApp.match(/<Pages>(\d+)<\/Pages>/);
if (match && match[1]) {
const count = Number(match[1]);
console.log(count);
}
});

In theory, the following property can return that information from the Word Open XML file, using the Open XML SDK:
int pageCount = (int) document.ExtendedFilePropertiesPart.Properties.Pages.Text;
In practice, however, this isn't reliable. It might work, but then again, it might not - it all depends on 1) What Word managed to save in the file before it was closed and 2) what kind of editing may have been done on the closed file.
The only sure way to get a page number or a page count is to open a document in the Word application interface. Page count and number of pages is calculated dynamically, during editing, by Word. When a document is closed, this information is static and not necessarily what it will be when the document is open or printed.
See also https://github.com/OfficeDev/Open-XML-SDK/issues/22 for confirmation.

When you say "do this in the browser" I assume that you have a running webserver with LAMP or the equivalent. In PHP, there is a pretty useful option for .docx files. An example php function would be:
function number_pages_docx($filename)
{
$docx = new docxArchive();
if($docx->open($filename) === true)
{
if(($index = $docx->locateName('docProps/app.xml')) !== false)
{
$data = $docx->getFromIndex($index);
$docx->close();
$xml = new SimpleXMLElement($data);
return $xml->Pages;
}
$docx->close();
}
return false;
}

How to insert html to document using javascript

I have a problem to insert html into document using javascript.
Code that trying to insert html into document:
function loadTaskPage(taskId){
fetch("https://localhost:44321/api/Tasks/1")
.then(function(response){
return response.text();
}).then(function(data){
document.body.insertAdjacentHTML('beforeend', data);
}).catch(function(error){
alert(error);
})
}
This code part I took from tutorial, source code of tutorial could be found in this link: https://github.com/bstavroulakis/progressive-web-apps/blob/master/car-deals/js/carService.js
If I will try to open this link https://localhost:44321/api/Tasks/1 in browser I receive normally styled web page, but when I try to insert it to document, html code got escaped and don't display anything.
Inserted html looks like:
<div id="\"myModal\"" class="\"modal" fade\"="">...
The code below is bootstrap modal copied from code examples. As you see there appeared symbols \" that escapes quotes.
Response with html I receive from my ASP.Net Web Api with header: text/html
How should I insert this html code into document using javascript?

How to insert html to document using javascript?
You can find that answer here:
You can use
document.getElementById("parentID").appendChild(/*..your content created using DOM methods..*/)
or
document.getElementById("parentID").innerHTML+= "new content"
As mentioned in the comments, this didn't seem to work and left the elements without style, this is because the escaping in the string being added to the innerHTML is off: there are too many ".
In the provided HTML example <div id="\"myModal\"" class="\"modal" fade\"="">... each attribute is surrounded by "\" ... \"" which means that if you were to look at the string of the attribute's value it would look something like '" ... "', which is what is causing the styles to not be added.
If you remove the extra " the HTML should be appended as expected:
<div id=\"myModal\" class=\"modal fade\">...
See this example showing what happens with the different escaping:
document.getElementById("foo").innerHTML += '<div class="\"bar\"">Hello World!</div>'; // Escaped with too many `"`
document.getElementById("foo").innerHTML += '<div class=\"bar\">Hello World!</div>'; // Properly escaped
.bar {
color: red;
}
<div id="foo">
</div>

The insertAdjacentHTML() method of the Element interface parses
the specified text as HTML or XML and inserts the resulting nodes into
the DOM tree at a specified position. It does not reparse the element
it is being used on, and thus it does not corrupt the existing
elements inside that element. This avoids the extra step of
serialization, making it much faster than direct innerHTML
manipulation.

Ok, looks like problem was in api service. In some reasons in debug mode showed to me correct html that I return to user. So after a few changes of api code all works as should.
If someone interested in ASP.Net Web API how to return view as a string and be able to add it to html all you need is to add Reference to RazorEngine and use the following code:
var response = new HttpResponseMessage(HttpStatusCode.OK);
var viewPath = HttpContext.Current.Server.MapPath(#"~/Views/Tasks/TaskDetails.cshtml");
var template = File.ReadAllText(viewPath);
var key = new NameOnlyTemplateKey("TaskDetails", ResolveType.Global, null);
if(!Engine.Razor.IsTemplateCached(key, null))
Engine.Razor.AddTemplate(key, new LoadedTemplateSource(template));
StringBuilder sb = new StringBuilder();
StringWriter sw = new StringWriter(sb);
Engine.Razor.RunCompile(key, sw, null, model);
response.Content = new StringContent(sb.ToString());
response.Content.Headers.ContentType = new MediaTypeHeaderValue("text/html");
return response;
P.S. Code is not completely correct. It requires some optimizations.

JavaScript Exception in HtmlUnit when clicking at google result page

I want to use HtmlUnit (v2.21) to get some search result pages from google. This requires me to click on "people also looked for" link when searching for a person (right side, see example link), which triggers some JavaScript and changes the content of the current page. But this gives me an JavaScript Wrapper Exception (see below).
Clickable example link: https://www.google.de/search?ie=UTF-8&safe=off&q=nicki+minaj
Simple TestCase with errors:
String url = "https://www.google.de/search?ie=UTF-8&safe=off&q=nicki+minaj";
WebClient client = new WebClient(BrowserVersion.BEST_SUPPORTED);
HtmlPage page = client.getPage(url);
HtmlElement link = page.getFirstByXPath("//a[#class='_Zjg']");
HtmlPage newPage = link.click(); //throws exception
this.storeResultFile(newPage.asXml(), "test");
client.close();
Result:
net.sourceforge.htmlunit.corejs.javascript.WrappedException: Wrapped java.lang.NullPointerException
at net.sourceforge.htmlunit.corejs.javascript.Context.throwAsScriptRuntimeEx(Context.java:2053)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.doProcessPostponedActions(JavaScriptEngine.java:947)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.processPostponedActions(JavaScriptEngine.java:1012)
at com.gargoylesoftware.htmlunit.html.DomElement.click(DomElement.java:799)
at com.gargoylesoftware.htmlunit.html.DomElement.click(DomElement.java:742)
at com.gargoylesoftware.htmlunit.html.DomElement.click(DomElement.java:689)
I stored the xml of the "page" object and made sure that the XPath expression is valid and has results.
Anybody got any ideas?

Looks like the JavaScript-Engine (based on Rhino) is very easy to upset and quits on some script-issues, where other browsers are still able to run the script.
I dont know if there is a mistake in the scripts from google, but these two lines solved it for me:
JavaScriptEngine engine = client.getJavaScriptEngine();
engine.holdPosponedActions();
Nevertheless, when running multiple htmlunit-objects in multiple threads it is still possible to get accross this error. This is more a workaround than a solution.

What is the best way to parse html in google apps script

var page = UrlFetchApp.fetch(contestURL);
var doc = XmlService.parse(page);
The above code gives a parse error when used, however if I replace the XmlService class with the deprecated Xml class, with the lenient flag set, it parses the html properly.
var page = UrlFetchApp.fetch(contestURL);
var doc = Xml.parse(page, true);
The problem is mostly caused because of no CDATA in the javascript part of the html and the parser complains with the following error.
The entity name must immediately follow the '&' in the entity reference.
Even if I remove all the <script>(.*?)</script> using regex, it still complains because the <br> tags aren't closed.
Is there a clean way of parsing html into a DOM tree.

I ran into this exact same problem. I was able to circumvent it by first using the deprecated Xml.parse, since it still works, then selecting the body XmlElement, then passing in its Xml String into the new XmlService.parse method:
var page = UrlFetchApp.fetch(contestURL);
var doc = Xml.parse(page, true);
var bodyHtml = doc.html.body.toXmlString();
doc = XmlService.parse(bodyHtml);
var root = doc.getRootElement();
Note: This solution may not work if the old Xml.parse is completely removed from Google Scripts.

In 2021, the best way to parse HTML on the .gs side that I know of is...
Click + next to Library
Enter 1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0
Click "Look up"
Click Add
Sample usage:
const contentText = UrlFetchApp.fetch('https://www.somesite.com/').getContentText();
const $ = Cheerio.load(contentText);
$('.some-class').first().text();
That's it -- this is probably the closest we'll get to doing jQuery-like DOM selection in GAS. The .first() is important or else you may extract more content than you expected (think of it as using querySelector() instead of querySelectorAll()).
Credit where credit is due: https://github.com/tani/cheeriogs

As of May 2020, you can now use the Cheerio library for Google Apps Script to do this.
Returns the content of Wikipedia's Main Page
const content = getContent_('https://en.wikipedia.org');
const $ = Cheerio.load(content);
Logger.log($('#mp-right').text());
Returns the content of the first paragraph <p> of Wikipedia's Main Page
const content = getContent_('https://en.wikipedia.org');
const $ = Cheerio.load(content);
Logger.log($('p').first().text());
To add to your project:
Select Resources - Libraries... in the Google Apps Script editor. Enter the project key 1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0 in the Add a library field, and click "Add". Select the highest version number, and click "Save".

I found that the best way to parse html in google apps is to avoid using XmlService.parse or Xml.parse. XmlService.parse doesn't work well with bad html code from certain websites.
Here a basic example on how you can parse any website easily without using XmlService.parse or Xml.parse. In this example, i am retrieving a list of president from "wikipedia.org/wiki/President_of_the_United_States"
whit a regular javascript document.getElementsByTagName(), and pasting the values into my google spreadsheet.
1- Create a new Google Sheet;
2- Click the menu Tools > Script editor... to open a new tab with the code editor window and copy the following code into your Code.gs:
function onOpen() {
var ui = SpreadsheetApp.getUi();
ui.createMenu("Parse Menu")
.addItem("Parse", "parserMenuItem")
.addToUi();
}
function parserMenuItem() {
var sideBar = HtmlService.createHtmlOutputFromFile("test");
SpreadsheetApp.getUi().showSidebar(sideBar);
}
function getUrlData(url) {
var doc = UrlFetchApp.fetch(url).getContentText()
return doc
}
function writeToSpreadSheet(data) {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheet = ss.getSheets()[0];
var row=1
for (var i = 0; i < data.length; i++) {
var x = data[i];
var range = sheet.getRange(row, 1)
range.setValue(x);
var row = row+1
}
}
3- Add an HTML file to your Apps Script project. Open the Script Editor and choose File > New > Html File, and name it 'test'.Then copy the following code into your test.html
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<input id= "mButon" type="button" value="Click here to get list"
onclick="parse()">
<div hidden id="mOutput"></div>
</body>
<script>
window.onload = onOpen;
function onOpen() {
var url = "https://en.wikipedia.org/wiki/President_of_the_United_States"
google.script.run.withSuccessHandler(writeHtmlOutput).getUrlData(url)
document.getElementById("mButon").style.visibility = "visible";
}
function writeHtmlOutput(x) {
document.getElementById('mOutput').innerHTML = x;
}
function parse() {
var list = document.getElementsByTagName("area");
var data = [];
for (var i = 0; i < list.length; i++) {
var x = list[i];
data.push(x.getAttribute("title"))
}
google.script.run.writeToSpreadSheet(data);
}
</script>
</html>
4- Save your gs and html files and Go back to your spreadsheet. Reload your Spreadsheet. Click on "Parse Menu" - "Parse". Then click on "Click here to get list" in the sidebar.

Xml.parse() has an option to turn on lenient parsing, which helps when parsing HTML. Note that the Xml service is deprecated however, and the newer XmlService doesn't have this functionality.

For simple tasks such as grabbing one value from a webpage, you could use a regular expression. Regex is notoriously bad for parsing HTML as there's all sorts of weird cases it can get tripped up, but if you're confident about the HTML you're accessing this can sometimes be the simplest way.
Here's an example that fetches the contents of the page's <title> tag:
var page = UrlFetchApp.fetch(contestURL);
var regExp = new RegExp("<title>(.*)</title>", "gi");
var result = regExp.exec(page.getContentText());
// [1] is the match group when using parenthesis in the pattern
var value = result ? result[1] : 'No title found';

I know it is not exactly what OP asked, but I found this question when I was looking for some html parsing options - so it might be useful for others as well.
There is an easy to use the library for TEXT parsing. It's useful if you want to get only one piece of information from the html(xml) code.
EDIT 2021: The script library id is:
1Mc8BthYthXx6CoIz90-JiSzSafVnT6U3t0z_W3hLTAX5ek4w0G_EIrNw
It works like in the picture above
function getData() {
var url = "https://chrome.google.com/webstore/detail/signaturesatori-central-s/fejomcfhljndadjlojamaklegghjnjfn?hl=en";
var fromText = '<span class="e-f-ih" title="';
var toText = '">';
var content = UrlFetchApp.fetch(url).getContentText();
var scraped = Parser
.data(content)
.from(fromText)
.to(toText)
.build();
Logger.log(scraped);
return scraped;
}

If you are using
Cheerio library for Google Apps Script
Source code
Library page (⭐ star it!)
Installation by library ID:
1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0
A function to get current emojis from unicode.org:
function getEmojis() {
var t = new Date();
var url = 'https://unicode.org/emoji/charts/full-emoji-list.html';
var fetch = UrlFetchApp.fetch(url);
var contentText = fetch.getContentText();
//console.log(new Date() - t);
// Cherio
var $ = Cheerio.load(contentText);
var data = [];
$("table > tbody > tr").each((index, element) => {
var row = [];
$(element).find("td").each((index, child) => {
row.push($(child).text());
});
if (row.length > 0) {
data.push(row);
}
});
//console.log(data);
//console.log(new Date() - t);
// Result
return data;
}
↑ Sample code shows how to parse table and put it into [[array]]
May be used as a custom function:
Bonus
Parsing the site may be a time-consuming operation + you may reach the limit.
Here's a test file with a full version of the script:
https://docs.google.com/spreadsheets/d/1iO7YjYWyfseQu_YCfRbGDPg7NskOgMu_iO1iGjr7KxY/edit#gid=93365395
↑ it uses CasheService to reduce the number of calls.

Natively there's no way unless you do what you already tried which wont work if the html doesnt conform with the xml format.

There are two options
a) One is to use JavaScript's string functions. First locate your tag using string.indexOf() and then extract the data you want using string.substring().
b) The other option is to make use of the Xml Service.

It's not possible to create an HTML DOM server-side in Apps Script. Using regular expressions is likely your best option, at least for simple parsing.

We Keep Coding

JavaScript is the programming language of the Web.

awesomium web scraping certain parts - javascript

Related

How can jQuery append() with backquote (``) miss/won't render a string variable?

Is there a way to get the page count of a word doc?

How to insert html to document using javascript

JavaScript Exception in HtmlUnit when clicking at google result page

What is the best way to parse html in google apps script

Categories

Resources