I want to build a web crawler/scraper web application using angular. Idea is to use client-side for making all http requests. Using a headless browser can take away much of the pain in parsing Html and evaluation JS code. Are there any headless JS-based browsers I can use?
I read about headless chrome and puppeteer. It turns out it can be used only from command line for running tests and not like a typical library which can work with angular. Or is there a way?
I am a scraper myself. This is how I solved the problem
get the html content using http/fetch
create an Iframe
Provide the string html content.
I am currently creating a solution using the same strategy above. Hope it helps.
Edit1: Do let me know in case you need a working demo. I feel its straightforward.
Related
We have a bunch of JavaScript code that uses UWP APIs that was written for a UWP Javascript app (.jsproj). Now this app is rewritten as a UWP C# app.
The UWP APIs in C#, JS and C++ are similar enough, see these examples for ApplicationData.LocationSetting, so migrating the code would not be that much of an effort - but it would still be work that has to be done.
Is there a way so that I do not have to rewrite all our JavaScript code in C# to be able to use it in our rewritten app? Can I somehow use the UWP JS APIs in a UWP C# app?
I was hoping I could use a simple webview to somehow access the APIs (my thinking was that the UWP JS app basically is just a webview), but in my testing I could not access them there.
Although I totally agree that JavaScript code looks similar to C# code. I'm afraid you cannot get what you want like your first post mentioned. The webview control is lightweight and I don't think it is possible for this control to include all required components for your js code to run.
If you've wrote some code in Windows Runtime Component before, like this doc Walkthrough: Creating a Simple Windows Runtime component and calling it from JavaScript mentioned, then you can reuse the Windows Runtime Component. But if you haven't done this, then I'm afraid you have to rewrite your code in C#.
Well, by the way, I believe you will find C# code is easy for you to write since there are more UWP C# demos then UWP JS demos.
This never got a proper answer, but the correct answer is to use the now-deprecated JSRT apis. These can be found here, and an old blog post about them to provide some understanding can be found here.
Obviously this isn't as useful anymore with the deprecation of Spartan Edge, but still can be helpful when there's no alternative options.
I'm attempting to change my command line js app to run in the browser instead in attempt to learn UI design. I was using nodejs to request an html and extract some simple text from it, but require is not compatible in browser. I know i can use browserify but is there a better, "go-to" option for this kind of task aside from forcing myself to use nodejs?
Thanks
So I have just been getting into learning NodeJS as part of learning how to build a webscraping tool for a project I wanted to make.
I have all the content I need from the NodeJS file when I run the file directly through the terminal, but I wanted to know how to run the code directly from a website I am building to display the content I get from webscraping.
Any and all help is appreciated!
(Also I am new to stackoverflow, so if you need any more info then I would be glad to help!)
Since Node.js runs on server side, you need to call the Node.js server through ajax and get the response back.
This website shows how to do Web Scraping in Node.js, now when you get the data pass it as a response to the browser.
You may also check Express.js which gives you "Fast, unopinionated, minimalist web framework for Node.js".
So you have working node application written in javascript. Perfect.
Now you want to run that in browser mode. you can use browserify for the same. Browserify will package all the nodejs module in a bundle and let you require from the browser.
Not exactly sure if this is what you are looking for, but you should look into a cloud9 space (its free) and using express to render the HTML. It's pretty straightforward.
I think Nodejs tutorial in tutorialspoint [http://www.tutorialspoint.com/nodejs/] is powerful solution. It is just advice.
I am writing a spider with scrapy, however, I come across some website which rendered with js, thus the urllib2.open_url does not work. I have found that I could open the browser with webbrowser.open_new(url), however, I did not find how to get the src code of page with webbrowser. Are there any way that I could use to do this with webbrowser, or are there any other solutions without webbrowser to deal with the js sites?
You can use scraper with Webkit engine available out there.
One of them is dryscrape.
Example:
import dryscrape
search_term = 'dryscrape'
# set up a web scraping session
sess = dryscrape.Session(base_url = 'http://google.com')
# we don't need images
sess.set_attribute('auto_load_images', False)
# visit homepage and search for a term
sess.visit('/')
q = sess.at_xpath('//*[#name="q"]')
q.set(search_term)
q.form().submit()
# extract all links
for link in sess.xpath('//a[#href]'):
print link['href']
# save a screenshot of the web page
sess.render('google.png')
print "Screenshot written to 'google.png'"
See more info at:
https://github.com/niklasb/dryscrape
https://dryscrape.readthedocs.org/en/latest/index.html
If you need a full js engine, there are a number of ways you can drive webkit from Python. Until recently, these sort of things were done with Selenium. Selenium drives an entire browser.
More recently there are newer and simpler ways to run a webkit engine (which includes the v8 javascript engine) from Python. See this SO question:
Headless Browser for Python (Javascript support REQUIRED!)
It references this blog as an example Scraping Javascript Webpages with Webkit . It looks to do more or less just what you need.
I'm trying to find an answer to the same problem for a few days now.
I suggest you try QT framework with WebKit.
There are two python bindings. One is PyQt and the other one is PySide. You can use them directly if you want to create something more complex or you want to have 100% control over your code.
For trivial stuff like executing JavaScript in a browser environment you can use Ghost.py. It has some sort of documentation and some problems when using it from the command line but otherwise it's just great.
If you need to process JavaScript you'll need to implement a JavaScript engine. This makes your spider much more complex. Mainly because JavaScript almost always modifies the DOM based on time or an action taken by the user. This makes it extremely challenging to process JS in a crawler.
If you really need to process JavaScript in your spider you can have a look at the JavaScript engine by Mozilla: https://developer.mozilla.org/en/docs/SpiderMonkey
I'm writing a web crawler (web spider) that crawl all links in a website.
My application is a Win32 App, written in C# with .Net framework 3.5.
Now I'm using HttpWebRequest an HttpWebResponse to communicate with the web server.
I also built my own Http Parser that can parse anything I want.
I found all link like "href", "src", "action"... in the parse.
But I can not solve one problem: Simulate Client Script in the page (like JS and VBS)
For example, if a link like:
a href = "javascript:buildLink(1)"
... with buildLink(parameter) is a Javascript function that will make a custom link due to the parameter.
Please help me to solve this problem. How to simulate JavaScript in this app? I can parse the HTML source code and take all JavaScript code to another file, but how to simulate a function of it?
Thanks.
Your only real option is to automate a browser. As other answers have said, you cannot reliably simulate browser javascript without having a complete DOM.
There are fortunately ways to automate the browser, check out Selenium.
It has a C# API, so you can control the browser from C#.
Use your .NET web crawler code to crawl the site. Whenever you encounter a href="javascript:... link, handle the page containing the link in Selenium:
Use the Selenium API to tell the browser to load the page.
Use the Selenium API to find all links on the page.
This way, your spider only uses Selenium when necessary (pages without javascript links can be handled by the browser-less spider code you already got). And since this is an embarrassingly parallel workload, you could easily have multiple Selenium processes running at the same time (either on one computer or on other computers).
But remember that href="javascript is hardly the only way a page can have dynamic links. The more common case is probably that a onload or $(document).ready() script manipulates the DOM and adds links that way.
To catch that case (and others), the spider probably will have to use Selenium for all pages that have a <script> tag.
You are basically pretending to be a browser, except that HttpWebRequest only does the networking stuff for you.
I would recommend using the ie web browser control and interop'ing into that from your c# application. That will allow you to run JavaScript, set variables, post, etc etc.
Here's some basic links I found after a search for "ie web browser control":
http://www.c-sharpcorner.com/UploadFile/mahesh/WebBrowserInCSMDB12022005001524AM/WebBrowserInCSMDB.aspx
http://support.microsoft.com/kb/313068
This is a problem which is not easily solved. You could consider taking one of the existing JavaScript implementations and porting or interfacing with it somehow.
If I were tackling this problem, I'd probably build a small side application in Java on top of Rhino, with some sort of RPC framework layered on top of that so that I could communicate with it from my primary application.
Unfortunately, without having a complete DOM implementation on top of that, you would be limited to only very simple javascript.
You could execute the javascript by using the MS JScript engine or something similar.
MSDN Reference
Eric Lippert's blog on using Eval (part 1) (part 2) (part 3)
This isn't guaranteed to work, especially if the javascript tries to access the DOM, or somesuch... But for simple scripts, it might be enough.