How to get content from a dynamic webpage

How to get content from a dynamic webpage - javascript

recently, I am developing an iOS app that want to get the content from a dynamic webpage.
Here is what I want to do:
http://www.ratemycoopjob.com
For the website above, I want to get ratings for employers I searched. The question is this page is a dynamic page, some javascript code will be called once search. I cannot treat this page as a static html file and parse it.

To parse an HTML in iOS, you could try HtmlParser of Ben Reeves (https://github.com/zootreeves/Objective-C-HMTL-Parser)
But first, you have to collect data from that site, generally by using REST service: POST/GET with right parameters or having the APIs
Edit: sorry, I did not read carefully your question. In case of JS, you can create an UIWebView, load this site into that view, then use
NSString *returnvalue = [webView stringByEvaluatingJavaScriptFromString:#"your javascript code string here"];
to get new content after affected a javascript on this site. However, you have to know about java script function of that site.
Good lucks

Related

Getting access to the original HTML in HtmlUnit HtmlElement?

I am using HtmlUnit to read content from a web site.
Everything works perfectly to the point where I am reading the content with:
HtmlDivision div = page.getHtmlElementById("my-id");
Even div.asText() returns the expected String object, but I want to get the original HTML inside <div>...</div> as a String object. How can I do that?
I am not willing to change HtlmUnit to something else, as the web site expects the client to run JavaScript, and HtmlUnit seems to be capable of doing what is required.

If by original HTML you mean the HTML code that HTMLUnit has already formatted then you can use div.asXml(). Now, if you really are looking for the original HTML the server sent you then you won't find a way to do so (at least up to v2.14).
Now, as a workaround, you could get the whole text of the page that the server sent you with this answer: How to get the pure raw HTML of a page in HTMLUnit while ignoring JavaScript and CSS?
As a side note, you should probably think twice why you need the HTML code. HTMLUnit will let you get the data from the code, so there shouldn't be any need to store the source code but rather the information it is contained in it. Just my 2 cents.

How Can I Read A Web Browser Hidden Document Value Using IWebBrowser2 In LabVIEW?

I've searched around the internet for a couple of hours and could not find anything related to what I'm trying to do. I wrote a HTML document that collects data from a user and stores it in a javascript array. This array is then joined together and stored as a string in a document which is hidden. Originally, I was going to transfer this string to a program I wrote in C#, but now I am using LabVIEW.
In C#, I used two simple lines of code to do what I wanted:
System.Windows.Forms.HtmlElement hidden = webBrowser1.Document.GetElementById("hiddenfield1");
List<latlng> data = formharvest.extract(hidden.GetAttribute("value"));
But now I cannot find a way to access the data that is in this hidden document. I'm using the IWebBrowser2 block to embed my HTML code in my VI. Any help would be greatly appreciated. Thank you for your time!

A solution would be to start a Web server in your LabVIEW program, and serve your HTML form from it. I suppose it wouldn't be too hard to retrieve form data then, but I haven't done such thing myself.
Here's an interesting discussion on this with sample code.

I'm not sure I really understand what you are doing above: in the C#, it looks like you are embedding an HTML-rendering engine in a Windows form (i.e. window).
You can embed .NET in your labview code and therefore you should be able to embed the same HTML rendering engine in a LabView VI, but you might consider changing your approach, as CharlesB suggests, to something more traditional where a server serves some HTML to a web browser, which then sends some data back to the server via HTTP GET or POST.

read information off website and store in excel file

I am trying to build this application that when provided a .txt file filled with isbn numbers will visit the isbn.nu page for that isbn number by simply appending the isbn to the url www.isbn.nu/your isbn number.
After pulling up the page, I want to scan it for information about the book, and store that in an excel file.
I was thinking about creating a file stream of the url in Java, but I am not really sure how to extract the information from the html page. Storing the information will be done using the JExcel Java package.
My best guess would be using javascript to extract the information, but I don't know how to call the javascript from my java program.
Is my idea plausible? if not, what do you guys suggest I do.
my goal: retrieve information from an html page and store it in an excel file for each ISBN in a text file. There can be any number of isbn's in a text file.
This isn't homework btw, I am simply doing this for an organization that donates books to Sudan. Currently they have 5 people cataloging these books manually and I am one of them.

Jsoup is a useful tool for parsing a web page and getting data from it. You can do it in Java and it's pretty easy.
You can parse the text file, build the URL with a string, send it in with JSoup then use JSoup to parse out the information using the html tags on the page. Then you can store it out however you want. You really don't need to use Javascript at all if you're more comfortable with Java.
Example for reading a page and parsing it with Jsoup:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

Use a div in which you load your link (example here how to do that http://api.jquery.com/load/).
After that when load is complete you can check what is the name of the div's or spans used in the webpage and get that content with val (http://api.jquery.com/val/) or text (http://api.jquery.com/text/)

Here is text from the main page of www.isbn.nu:
Please note that isbn.nu is designed for manual searching by individuals. It is not intended as an information resource for automated retrieval, nor as a research tool for companies. isbn.nu reserves the right to deny access based on excessive requests.
Why not just use the free Google books API that would return book details in XML format. There are many classes available in Java to parse XML feeds and would make your life much easier.
See http://code.google.com/apis/books/ for more info.

Here are the steps needed:
Create CURL request (you can use multiple curl requests)
Get body data
Parse data
Make excel file
You can read HTML information using this guide.

A simple solution might be to use a Google Docs spreadsheet function like ImportXML(URL,path-expression).
More information and examples here:
http://www.seerinteractive.com/blog/importxml-cookbook/
http://www.distilled.net/blog/distilled/guide-to-google-docs-importxml/
http://blog.ouseful.info/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/

I need to extract somehow (probably using JavaScript) some information in my clients' websites. What's the best way to do it?

I want to plug my clients' websites to a system that I have. I need to be able to use some information that is in the website in order to improve the user experience in my system (automatically pre-filled forms, show their address, etc...).
The problem I face is that my client's website provider will not code that feature (add a link passing the information I need). So my idea is to have a JavaScript file that will be included in all the pages (they are willing to do this, because it's only copy & paste)... and then this JavaScript code will somehow extract the data I need and create the link the way I need.
One thing that will help is that all my clients' websites are provided by the same companies, and they are all template-based. So all the websites from the same provider have the same HTML structure.
Do you know any other way of doing this? If JavaScript is the way to go, what's the best way to scrape the information?
Thanks!

I'm not sure if your 'system' is a web tool or desktop based program, but if it is a web tool dynamic drive have a nice piece of javascript that can achieve the results you want without needing to modify the clients site:
Dynamic Ajax Content
Now I'm guessing you may want to change the content around your self and not display it exactly as it is on your clients site. So heres a quick modification of their script function loadpage() so that you can catch the html in a variable (loadedContent):
var loadedContent;
function loadpage(page_request, containerid){
if (page_request.readyState == 4 && (page_request.status==200 || window.location.href.indexOf("http")==-1))
loadedContent = page_request.responseText
}
Now if you follow the instructions on their page to setup and call the script ... after its execution you will have the html of the page stored in loadedContent for you to play about with.
if you want to test it working before you implement it, go to the link above, open your developer console, put the moded code in and hit enter. This should replace their function on the fly. Now see their demo at the top, click on one of the different pages. Nothing visible should happen. Go to your console and now type in loadedContent. You should see the html they where trying to load stored there.
Hope this helps

SharePoint 2010 - Content Editor Web Part Duplicating Entries

I'm using Javascript inside of a SharePoint 2010 Content Editor Web Part to insert a Silverlight object. I need to do it this way instead of use a Silverlight Web Part because Silverlight Web Parts are not currently enabled. This is done entirely using Javascript.
The problem occurs when I go later to edit the Javascript inside of the CEWP - I can see the orginal Javascript requesting generation of Silverlight object and I can, this is the strange part, the CEWP has all of the generated HTML of the Silverlight object right there appended to the scrept.
So now, when I save, I save the script to generate the Silverlight object AND the HTML that was previously generated effectively duplicating the Silverlight object. If I edit again then I will now have three Silverlight objects and so on.
You can see this in action for yourself with the following sample code:
Add a new Content Editor Web Part to a page in SharePoint 2010
Edit the source HTML
Add the following code:
<script type="text/javascript">document.write("Hello<br/>");</script>
Save the web part and you're done. Now, just keep editing the CEWP. Every time you click "Edit Web Part", "Hello" will be appended to your script.
How can I use Javascript to insert DOM elements and not have the generated HTML appear in the CEWP?

It's not working because SharePoint 2010 doesn't want you copying and pasting scripts into the editor. Instead, you should be putting your scripts inside a txt file (yes, that's right a txt file) stored in SharePoint and then pointing the CEWP to use that file as the source.
First, create a file with all of your code (both Javascript and HTML - basically everything you would have normally pasted into the content editor.) Make sure to wrap your Javascript in the <script type="text/javascript"> tag and save the file with a .txt extension like "scripts.txt".
Next, add a CEWP to your page and select "Edit Web Part." In the content editor pane on the right, under "Content Link", add the URL to your txt file and click "Apply" and you're done.
Take a look at the following URL for a full description of this change in SharePoint 2010: http://sptwentyten.wordpress.com/2010/08/31/insert-javascript-into-a-content-editor-web-part-cewp/

Use jQuery - probably far safer than a document.write which can break javascript further down the page.
Or use the code in this link to put pure HTML in the CEWP instead of dabbling with JavaScript:
http://karinebosch.wordpress.com/silverlight-meets-sharepoint/walkthrough-2-hosting-silverlight-3-in-a-content-query-web-part/

Another option is to the HTML Form Web Part (in the Forms category). This can be used to connect to other web parts, but more simply it can used to edit JavaScript directly in the web part. It seems that the rules for Content Editor Web Parts do not apply to the HTML Form Web Parts so it allows more flexibility.
More information from Microsoft is here:
http://office.microsoft.com/en-us/sharepoint-server-help/use-the-html-form-web-part-to-filter-and-display-data-in-another-web-part-HA101791813.aspx#_Toc274731120

We Keep Coding

JavaScript is the programming language of the Web.

How to get content from a dynamic webpage - javascript

Related

Getting access to the original HTML in HtmlUnit HtmlElement?

How Can I Read A Web Browser Hidden Document Value Using IWebBrowser2 In LabVIEW?

read information off website and store in excel file

I need to extract somehow (probably using JavaScript) some information in my clients' websites. What's the best way to do it?

SharePoint 2010 - Content Editor Web Part Duplicating Entries

Categories

Resources