Parse the contents of a webpage

Parse the contents of a webpage - javascript

I'm working on an html page for my department at work. Just html and css nothing fancy. Now, we are trying to get data from another webpage to be displayed in the new one we are working on. I assume that I would need to use JavaScript and a parser of some sort but I'm not sure how to do this or what really to search for.
The solution I assume would exist is to have a function, feed it a link of the webpage we want to mine, and it would return (for example) the number of times a certain word was repeated in that webpage.

The best way to go for it is by using node.js and then installing cheerio (parser) and request (http request) module. There are many detailed tutorials showing how to do this (for e.g. this one at digital ocean).
But, if you don't want to have nodejs setup and want to work with plain web setup. Then, download cheerio and request js libraries and include them in your html page in tag and then follow above example. I hope it helps.

Related

Best Way to Automatically Publish Content of a Web Page Article online Into an Existing Template (with FTP?)

Say I have a news website with articles, I have a blank article page with everything BUT the headline, photos, and the text article itself that I would ordinarily fill in manually. Instead of filling it in, say I have the entire div class ripped from a web page already. I want to import this content directly onto the page and publish it with minimal steps.
(I hope I'm giving you the picture. Imagine I have cars fully built aside from missing engines and I want the monkeys I've hired to steal engines to not leave the engines piling up outside, but instead to also bring them inside and install them into the cars and drive them to the car dealer.)
I will be web scraping something like a Wikipedia page on golf and putting that into my page. I don't want to have to copy, paste and click publish over and over. I want the web scraper, which I already know how to build, to go another step and do a find and replace of a certain div class on my blank page website INSTEAD of writing the data on a file on my computer's hard drive (though maybe writing on my hard drive with Python, then having JS or something read the HTML file on my hard drive THEN writing it to my web page would be a way to do it.
Are there programs that will do this? Do you know of modules that will do this through Python? Do you know of anything like this somebody wrote and put up on GitHub?
I'm not planning on ripping off news websites, but just to give a simpler example with one object... If I had the entire div class "content" from here...
http://www.zerohedge.com/news/2017-02-18/merkel-says-there-problem-euro-blames-mario-draghi
saved as an HTML file on my hard drive (which you could look at by clicking 'inspect' anywhere on the text of the main article> right clicking copy> copy as outerHTML> and pasting as an HTML in your text editor (again, something I would have done with a web scraper), how could I get this pasted into a blank 'new article' page and published on my website with the push of a button automatically? I'm fine with having to click a few buttons but not copying and pasting.
I'll be doing this (legally) with parts of web pages over and over and over again and I'm sure this can be automated in some way. I've heard financial news websites have been writing articles from data so something like what I need probably exist. I might be running the text I scrape through a basic neural net or feeding it to GANs. I think some interesting things can be made this way in case you are curious what I'm up to.

If you're using Python to do this, the quickest way I feel would be to have the web crawler save it's findings to either a JSON file or SQL database that your website front-end shares access to (storing the HTML you pulled as a string of text).
If you go the JSON route, just send an AJAX request to it for the website and place it in using innerHTML on the element you're dumping the code into.
If you go the SQL route, just have a python script with the website that you can send a POST request to and have the python script pull the website data you want from the database and return it to the browser as JSON and do the same as the above.
The benefit of going straight to JSON is not having to setup connection to an SQL server and deal with the SQL query to JSON conversion step. However, the benefit of the SQL database is not having to worry about any issues writing to the JSON file if your crawler is working with multiple threads and may have write conflicts if you don't lock the file correctly.

Can Googlebot crawl javascript generated content?

We have a web app that its content generated by javascript. Can google index those pages?
When we investigate this issue we always found solutions from old pages about using "#!" in links.
In our app the links are like this:
domain.com/paris
domain.com/london
When we use these kind of links, javascript populates content.
Is it wise to use HTML snapshot or do you have any other suggestions?

Short answer
Yes they can crawl JavaScript generated content, as long as you are using pushstates.
Detailed answer
It depends on your setup. Google and Bing CAN crawl javascript and AJAX based content if your are using pushstates. If you do they will handle content coming from AJAX calls, updates to page title or meta tags using javascript, and in general any such things.
Most frontend frameworks like Angular, Ember or Backbone already works with pushstates so in these cases you don't need to do anything. Check whatever system you are using to see how they do things. If you are not using pushstates you will need to implement it on your own or use the whole escapted_fragment html snapshot deal.
So if you use pushstate then yes, search engines can crawl your page just fine. If you don't then no, you will need to implement pushstates or do HTML snapshots.
Bonus info - Unfortunately Facebook does not handle pushstates, so the facebook crawler needs either non-dynamic og-tags or HTML snapshots.

"Generated by JavaScript" is ambiguous. That could mean that you are running a JS script on the server or it could mean that you are making an AJAX call with a JS API. The difference appears to matter as far as Googlebot is concerned. But you don't have to take my word for it, as there is empirical proof of what Googlebot will and won't currently cache as far as JavaScript content in the form of live experiments using both the XMLHTTPRequest API and the Fetch API. So, as you can see, server-side rendering is still going to be the best way to go for SEO.

Get the HTML source code including the content populated by JSON

I am working on web crawler and i found some of the website populate their content by JSON. This makes me hard time to get the data using SIMPLE HTML DOM. Is there any way to get the final HTML Code that I could able to see in the inspect element?

This is not a trivial task. You'll need to use a "headless browser" and actually execute the JavaScript on the page. There are several headless browser implementations out there to choose from (just search on the term); then, of course, you'll have to drive them from PHP.

Accessing html form fields with an external application

I created a command line tool to help expedite HTML form filling. It uses a brute force approach in that it sends TAB keys to a window and writes info from a config file. This is unstable so I want to refactor it to set form fields using javascript.
I've looked into writing a Firefox addon to do this. I was able to hard-code each field id and write to it from a config file. My issue is I need this functionality in IE.
Is there a way an external application (ie cmd line tool) can write to HTML fields using javascript? I've tried recreating the entire html page with form fields filled in Java. I then try to send this to the normal destination using an HTTP POST. I ran into authentication issues because the forms require a log in.
My other idea is looking into web service tricks. It may be unrelated, I have no idea.

Why not try something like Selenium?
It will stop your reliance on hard coding everything as you have pretty much free reign over the DOM.
Correct me if I'm wrong, though.

You can open an CwebBrowser2 in your C++/C# application and use it as an HTML browser and get all the HTML programatically. You can then parse the HTML with a XML parses and call certain Javascript hooks.
The HTTP Post idea still seems best, if you have trouble with authenticating you just need to mimic that part as well or get the session ID (if a given session is enough for you).

How can I get dynamically web content using Perl?

This is kind of tricky. There is this webpage which, I am guessing, uses some kind of AJAX to pull out content based on the search query. When I fetch the page using get in Perl, it fetches the script code behind the php/html, but not the results which are displayed when the query is searched manually. I need to be able to fetch the content of the results page. Is there anyway to do this in Perl?

Take a look at Selenium RC and the WWW::Selenium module in Perl. With them you can control a real web browser.
Another option is WWW::HtmlUnit which uses the HtmlUnit Java library to execute the JavaScript without a web browser. WWW::HtmlUnit uses Inline::Java to give Perl access to the library. I have found that when installing, it is best to say No to the question "Do you wish to build the JNI extension?".

If you are writing tests that need to check the rendered page, you can have a look at Schwern's javascript-tap-harness, which works with Selenium and handles all the scaffolding.
I also found Using WWW::Selenium To Test Or Automate An Ajax Website pretty useful.

We Keep Coding

JavaScript is the programming language of the Web.

Parse the contents of a webpage - javascript

Related

Best Way to Automatically Publish Content of a Web Page Article online Into an Existing Template (with FTP?)

Can Googlebot crawl javascript generated content?

Get the HTML source code including the content populated by JSON

Accessing html form fields with an external application

How can I get dynamically web content using Perl?

Categories

Resources