Problem:
I am parsing pages generated by JS using HtmlUnit.
I have to wait until all JS are loaded and then parse page.
All these pages share same JS scripts.
There is a one problematic script that won't parse.
The problematic script does not affect html rendering.
What I want to do:
I want to detect name of the problematic script.
Put this name on blacklist.
And skip it for further parsing.
This is the code I use for JS loading...
private void waitForJs(WebClient client, HtmlPage page) throws Exception {
int maxDelay = 1000;
int attempts = 10;
int i = client.waitForBackgroundJavaScript(maxDelay);
while (i > 0 && attempts > 0) {
i = client.waitForBackgroundJavaScript(maxDelay);
if (i == 0) {
break;
}
synchronized (page) {
page.wait(500);
}
log("Waiting for JS (" + i + "), attempts: " + attempts, false);
attempts--;
}
}
I had to intoduce "attempts" variable in order to not stuck on loading of damaged script. Instead of this, I want to put all problematic script(s) - remaining in waitForJs - on blacklist and skip their loading in the futures. Is it possible?
The code above has an encoding issue - we have to use the correct charset when getting the bytes from the content string.
WebResponseData data = new WebResponseData(content.getBytes(response.getContentCharset()),
response.getStatusCode(), response.getStatusMessage(), response.getResponseHeaders());
You can modify the content of the JavaScript to be empty string, as hinted here:
new WebConnectionWrapper(webClient) {
public WebResponse getResponse(WebRequest request) throws IOException {
WebResponse response = super.getResponse(request);
if (request.getUrl().toExternalForm().contains("my_url")) {
String content = response.getContentAsString();
// change content
content = "";
WebResponseData data = new WebResponseData(content.getBytes(),
response.getStatusCode(), response.getStatusMessage(), response.getResponseHeaders());
response = new WebResponse(data, request, response.getLoadTime());
}
return response;
}
};
Related
I need to scrape some pages, the problem is that some of these pages are using javascript to load part of their contexts and some not! and there is no common tag or content to determine if context loaded! also I can't use timer or loop to wait and check if context changed! Currently I'm using web-browser to scrape and pars the context.
I'm already using following code to check if page completely loaded and check if page content is changed but it's not work properly.
while (wb.ReadyState != System.Windows.Forms.WebBrowserReadyState.Complete)
{
System.Windows.Forms.Application.DoEvents();
}
Any idea how tackle this? Thanks.
I hope the following code will help you
Create a function method to wait for a few seconds
public void Wait(int sec)
{
System.Windows.Forms.Timer timer1 = new System.Windows.Forms.Timer();
if (sec == 0 || sec < 0) return;
timer1.Interval = sec * 1000;
timer1.Enabled = true;
timer1.Start();
timer1.Tick += (s, e) =>
{
timer1.Enabled = false;
timer1.Stop();
};
while (timer1.Enabled)
{
Application.DoEvents();
}
}
Write the following code in the DocumentCompleted event. Check the element has value or null if null wait for 2 sec and continue this process 30 times, nearly one minute. If it is not loaded display a message like a page not loaded
int cnt = 0;
HtmlElement htmlElement = WebBrowser1.Document.GetElementById("elementID")
do
{
Wait(2);
cnt++;
htmlElement = WebBrowser1.Document.GetElementById("elementID")
if (cnt > 30)
{
throw new Exception();
}
} while (htmlElement == null);
If scraping using a browser works, then try using PuppeteerSharp, which is a "Headless Chrome .NET API".
You should be able to do the same thing entirely in C#.
I am trying to write an application in C# using CefSharp. My intention is to fetch all the links on the given page eg,
https://wixlabs---dropbox-folder.appspot.com/index?instance=lp5CbqBbK6JUFzCW2hXENEgT4Jn0Q-U1-lIAgEbjeio.eyJpbnN0YW5jZUlkIjoiYjNiNzk5YjktNjE5MS00ZDM0LTg3ZGQtYjY2MzI1NWEwMDNhIiwiYXBwRGVmSWQiOiIxNDkyNDg2NC01NmQ1LWI5NGItMDYwZi1jZDU3YmQxNmNjMjYiLCJzaWduRGF0ZSI6IjIwMTgtMDEtMjJUMTg6Mzk6MjkuNjAwWiIsInVpZCI6bnVsbCwidmVuZG9yUHJvZHVjdElkIjpudWxsLCJkZW1vTW9kZSI6ZmFsc2V9&target=_top&width=728&compId=comp-j6bjhny1&viewMode=viewer-seo
When I load the page and open the dev tools and execute
document.getElementsByTagName('a');
in the dev tools I get 374 results. Next I execute the following code from BrowserLoadingStateChanged:-
private async Task ProcessLinksAsync()
{
var frame = browser.GetMainFrame();
var response = await frame.EvaluateScriptAsync("(function() { return document.getElementsByTagName('a'); })();", null);
ExpandoObject result = response.Result as ExpandoObject;
Console.WriteLine("Result:" + result);//What do I do here?
}
I get an expando object which seems to contain nothing. I am saying this because I used a break point and inspected the object. I have gone through https://keyholesoftware.com/2019/02/11/create-your-own-web-bots-in-net-with-cefsharp/ , https://github.com/cefsharp/CefSharp/wiki/General-Usage#javascript-integration and the questions on SO but was unable to solve my problem.
Am I doing something wrong here?
My actual intention is to fetch the links and then navigate to them.
Thanks in advance.
EDIT:
I used the following script in browser and dev tools both return 187 results which is correct.
(function() {
var links=document.getElementsByClassName('file-link');
var linksArray = new Array();
for (var i = 0; i < links.length; i++) {
linksArray[i] = String(links[i].href);
}
return linksArray;
})();
But in my application I get a 0 length array.
EDIT-2:
I used the following code to get the DOM:-
public void OnContextCreated(IWebBrowser browserControl, IBrowser browser, IFrame frame)
{
ContextCreated?.Invoke(this, frame);
const string script = "document.addEventListener('DOMContentLoaded', function(){ alert(document.links.length); });";
frame.ExecuteJavaScriptAsync(script);
}
For every other site I tried the code was successful except the URL mentioned above. Could any one possibly tell me what could be possibly wrong as the DOM is loaded in the dev tools and fully accessible. So, I guess something might be missing in my code.
Thanks again.
You need to wait for the page loads. Also if the page loads data using ajax, you need to wait a bit to data also load. Then you need to shape the result to a custom javascript object.
ChromiumWebBrowser browser;
protected override void OnLoad(EventArgs e)
{
base.OnLoad(e);
browser = new ChromiumWebBrowser(
"https://google.com/"); // Tried with your URL.
browser.LoadingStateChanged += Browser_LoadingStateChanged;
browser.Dock = DockStyle.Fill;
Controls.Add(browser);
}
private async void Browser_LoadingStateChanged(object sender,
LoadingStateChangedEventArgs e)
{
if (!e.IsLoading)
{
await Task.Delay(5000); //Just for pages which use ajax loading data
var script = #"
(function () {
var data = document.getElementsByTagName('a');
return Array.from(data, a => ({href:a.href, innerText:a.innerText}));
})();";
var result = await browser.EvaluateScriptAsync(script);
var data = (IEnumerable<dynamic>)result.Result;
MessageBox.Show(string.Join("\n", data.Select(x=>$"{x.href}").Distinct()));
}
}
We have data as XML and there are multiple formatting XSL styles. It was working fine till now in IE.
Then, We needed to display the same content as HTML in Chrome. So, We found an API on server side (Java) to transform XML+XSL to HTML.
public static String convertXMLXSL(String xml, String xsl) throws SQLException {
System.setProperty("javax.xml.transform.TransformerFactory", "org.apache.xalan.processor.TransformerFactoryImpl");
TransformerFactory tFactory = TransformerFactory.newInstance();
String html = "";
try {
try {
StreamResult result = new StreamResult(new StringWriter());
**Transformer trans = tFactory.newTransformer(new StreamSource(new ByteArrayInputStream(xsl.getBytes("utf-8"))));
trans.transform(new StreamSource(new ByteArrayInputStream(xml.getBytes("utf-8"))), result);**
html = result.getWriter().toString();
} catch (TransformerException te) {
te.printStackTrace();
}
} catch (Exception e) {
AppendExceptionToLog(e);
}
return html;
}
But, Now after sometime, We see some thread dumps which are blocked at trasform method of javax.xml.transform.Transformer
Sep 12, 2017 12:07:49 PM org.apache.catalina.valves.StuckThreadDetectionValve notifyStuckThreadDetected
WARNING: Thread "http-8080-12" (id=15800) has been active for 6,516 milliseconds (since 9/12/17 12:07 PM) to serve the same request for
and may be stuck (configured threshold for this StuckThreadDetectionValve is 5 seconds).
There is/are 3 thread(s) in total that are monitored by this Valve and may be stuck.
java.lang.Throwable
at org.apache.xpath.axes.AxesWalker.getNextNode(AxesWalker.java:333)
at org.apache.xpath.axes.AxesWalker.nextNode(AxesWalker.java:361)
at org.apache.xpath.axes.WalkingIterator.nextNode(WalkingIterator.java:192)
at org.apache.xpath.axes.NodeSequence.nextNode(NodeSequence.java:281)
at org.apache.xpath.axes.NodeSequence.runTo(NodeSequence.java:435)
at org.apache.xpath.axes.NodeSequence.setRoot(NodeSequence.java:218)
at org.apache.xpath.axes.LocPathIterator.execute(LocPathIterator.java:210)
at org.apache.xpath.XPath.execute(XPath.java:335)
at org.apache.xalan.templates.ElemVariable.getValue(ElemVariable.java:278)
at org.apache.xalan.templates.ElemVariable.execute(ElemVariable.java:246)
at org.apache.xalan.transformer.TransformerImpl.executeChildTemplates(TransformerImpl.java:2411)
at org.apache.xalan.templates.ElemLiteralResult.execute(ElemLiteralResult.java:1374)
at org.apache.xalan.transformer.TransformerImpl.executeChildTemplates(TransformerImpl.java:2411)
at org.apache.xalan.transformer.TransformerImpl.applyTemplateToNode(TransformerImpl.java:2281)
at org.apache.xalan.transformer.TransformerImpl.transformNode(TransformerImpl.java:1367)
at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:709)
at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:1284)
at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:1262)
at Util.processXMLXSL(Util.java:3364)
Here, I wanted to know..
1) Do we have any other known implementation do the same on server side ?
2) Should I consider using client side approach using XSLTProcessor of Mozilla ?
I've got a winforms app that has a ChromiumWebBrowser control and some basic windows controls. I want to be able to click a button, call javascript to get the value of a textbox in the browser, and copy the returned value to a textbox in the winforms app. Here is my code:
string script = "(function() {return document.getElementById('Email');})();";
string returnValue = "";
var task = browser.EvaluateScriptAsync(script, new { });
await task.ContinueWith(t =>
{
if (!t.IsFaulted)
{
var response = t.Result;
if (response.Success && response.Result != null)
{
returnValue = (string)response.Result;
}
}
});
txtTarget.Text = returnValue;
The result that comes back however is just "{ }". I've loaded the same web page in Chrome and executed the same javascript in the dev tools and I get the textbox value as expected.
The demo I looked at had sample code, simply "return 1+1;", and when I tried that I was getting the value "2" returned instead of "{ }". Interestingly, when I tried
string script = "(function() {return 'hello';})()";
I was still getting "{ }", almost as though this doesn't work with strings.
I've been scratching my head at this for a while and haven't been able to figure out how to solve this. Am I making a very basic syntax error or is there something more complicated going on?
So I think I've figured it out:
string script = "(function() {return document.getElementById('Email').value;})();";
string returnValue = "";
var task = browser.EvaluateScriptAsync(script);
await task.ContinueWith(t =>
{
if (!t.IsFaulted)
{
var response = t.Result;
if (response.Success && response.Result != null)
{
returnValue = response.Result.ToString();
}
}
});
txtTarget.Text = returnValue;
Removing the args object from EvaluateScriptAsync seemed to fix the issue. Not sure what the problem was - perhaps it was trying to run the javascript function with an empty args object when it shouldn't take any parameters?
Either way, it's resolved now.
public void SetElementValueById(ChromiumWebBrowser myCwb, string eltId, string setValue)
{
string script = string.Format("(function() {{document.getElementById('{0}').value='{1}';}})()", eltId, setValue);
myCwb.ExecuteScriptAsync(script);
}
public string GetElementValueById(ChromiumWebBrowser myCwb, string eltId)
{
string script = string.Format("(function() {{return document.getElementById('{0}').value;}})();",
eltId);
JavascriptResponse jr = myCwb.EvaluateScriptAsync(script).Result;
return jr.Result.ToString();
}
My javascript:
var params = {};
params.selectedCurrency = 'USD';
params.orderIdForTax = '500001';
var xhrArgs1 = {
url : 'UpdateCurrencyCmd',
handleAs : 'text',
content : params,
preventCache:false,
load:function(data){
alert('success!');
},
error: function(error){
alert(error);
//the alert says 'SyntaxError: syntax error'
},
timeout:100000
};
dojo.xhrPost(xhrArgs1);
I tried debugging with firebug, i do get the appropriate response (i think). Here it is;
/*
{
"orderIdForTax": ["500001"],
"selectedCurrency": ["USD"]
}
*/
The comments /* and */ are somehow embedded automatically cuz the url im hitting with xhrPost is actually a command class on ibm's websphere commerce environment. Can anyone tell me what am i doing wrong here?
Server code
public void performExecute() throws ECException {
try{
super.performExecute();
double taxTotal;
System.out.println("Updating currency in UpdateCurrencyCmd...");
GlobalizationContext cntxt = (GlobalizationContext) getCommandContext().getContext(GlobalizationContext.CONTEXT_NAME);
if(requestProperties.containsKey("selectedCurrency"))
selectedCurrency = requestProperties.getString("selectedCurrency");
else
selectedCurrency = cntxt.getCurrency();
if(requestProperties.containsKey("orderIdForTax"))
orderId = requestProperties.getString("orderIdForTax");
OrderAccessBean orderBean = new OrderAccessBean();
cntxt.setCurrency(selectedCurrency.toUpperCase());
orderBean.setInitKey_orderId(orderId);
orderBean.refreshCopyHelper();
orderBean.setCurrency(selectedCurrency.toUpperCase());
orderBean.commitCopyHelper();
TypedProperty rspProp = new TypedProperty();
rspProp.put(ECConstants.EC_VIEWTASKNAME, "AjaxActionResponse");
setResponseProperties(rspProp);
}catch(Exception e){
System.out.println("Error: " + e.getMessage() );
}
}
The problem was with my client side code, weirdly.
load:function(data){
data = data.replace("/*", "");
data = data.replace("*/", "");
var obj = eval('(' + data + ')');
alert('Success');
}
Its weird but this worked. Lol.
I guess the problem is with coment-filtering option of handle as method.
The response should be comment filered as below.
See tha AjaxActionResponse.jsp (WCS)
vailable Handlers
There are several pre-defined contentHandlers available to use. The value represents the key in the handlers map.
text (default) - Simply returns the response text
json - Converts response text into a JSON object
xml - Returns a XML document
javascript - Evaluates the response text
json-comment-filtered - A (arguably unsafe) handler to preventing JavaScript hijacking
json-comment-optional - A handler which detects the presence of a filtered response and toggles between json or json-comment-filtered appropriately.
Examples