I have used php simple html dom to no success on this issue.
Now I have gone to DOMDocument and DOMXpath and this does seem promising.
Here is my issue:
I am trying to scrape data from a page which is loaded via a web service request after the page initially shows. It is only milliseconds but because of this, normal scraping shows a template value as opposed to the actual data.
I have found the endpoint url using chrome developer network settings. So if I enter that url into the browser address bar the data displays nicely in JSON format. All Good.
My problem arises because any time the site is re-visited or the page refreshed, the suffix of the endpoint url is randomly-generated so I can't hard-code this url into my php file. For example the end of the url is "?=253648592" on first visit but on refresh it could be "?=375482910". The base of the url is static.
Without getting into headless browsers (I tried and MY head hurts!) is there a way to have Xpath find this random url when the page loads?
Sorry for being so long-winded but I wanted to explain as best I could.
It's probably much easier and faster to just use a regex if you only need one item/value from the HTML. I would like to give an example but therefor I would need a more extended snippet of how the HTML looks like that contains the endpoint that you want to fetch.
Is it possible to give a snippet of the HTML that contains the endpoint?
Related
everyone. I am making a website with t-shirts. I dynamically generate preview cards for products using a JSON file but I also need to generate content for an HTML file when clicking on the card. So, when I click on it, a new HTML page opens like product.html?product_id=id. I do not understand how to check for id or this part ?prodcut_id=id, and based on id it generates content for the page. Can anyone please link some guides or good solutions, I don't understand anything :(.
It sounds like you want the user's browser to ask the server to load a particular page based on the value of a variable called product_id.
The way a browser talks to a server is an HTTP Request, about which you can learn all the basics on javascipt.info and/or MDN.
The ?product_id=id is called the 'query' part of the URL, about which you can learn more on MDN and Wikipedia.
A request that gets a page with this kind of URL from the server is usually a GET request, which is simpler and requires less security than the more common and versatile POST request type.
You may notice some of the resources talking about AJAX requests (which are used to update part of the current page without reloading the whole thing), but you won't need to worry about this since you're just trying to have the browser navigate to a new page.
Your server needs to have some code to handle any such requests, basically saying:
"If anybody sends an HTTP GET request here, look at the value of the product_id variable and compare it to my available HTML files. If there's a match, send a response with the matching file, and if there's no match, send a page that says 'Error 404'."
That's the quick overview anyway. The resources will tell you much more about the details.
There are some solutions, how you can get the parameters from the url:
Get ID from URL with jQuery
It would also makes sense to understand what is a REST Api and how to build a own one, because i think you dont have a backend at the moment.
Here some refs:
https://www.conceptatech.com/blog/difference-front-end-back-end-development
https://www.tutorialspoint.com/nodejs/nodejs_restful_api.htm
You would think my problem would be so commonplace that there would be solutions all over the internet for it. But I can't find anything that really answers my question.
Let me summarise my situation:
I am using Open UI5.
I am coding an app which retrieves documents from various external websites. I want to display these documents inside my app, and not navigate to them, so I display the documents in an iframe. Haven't found any other way.
Some filetypes can be displayed natively, such as PDFs. Others, like Word, cannot - the easiest way I have found of displaying these is by using Google Docs, which implies changing the URL of the iframe's src from this :
http://example.com/my-target-doc.docx
to this:
http://docs.google.com/gview?url=example.com/my-target-doc.docx&embedded=true
Some of the external domains I retrieve the documents from require authentication. Therefore, I cannot set the iframe's src to http://docs.google.com/gview?url=example.com/my-target-doc.docx&embedded=true directly - Google docs would attempt to display the authentication page. I must keep the original URL, and then, once the user's authenticated, replace the document URL with the Google docs version of the same URL.
What I am trying to do, then, is use the iframe's "onload" event to get the currently loaded page's address and, if it is a .doc/.docx/.ppt etc, replace that same URL with the GD version of the URL.
The difficulty is that there is no extension at the end of the URL which points to the document - none of the URLs I need to use end with ".doc", ".ppt" or whatever, so parsing the URL is out.
So this is my question : Is there a way in Javascript to get the type of the content being returned? To be fair, I am pretty doubtful there is. Other ideas or alternatives are welcome. I am still actively looking for some.
Thanks!
Did you already look at the Content-type HTTP header? This can be read with JS, but you probably have to request the file asynchronously for that.
In my textbook the URL http://services.faa.gov/airport/status/SFO?format=application/JSON was provided. That link points to a page that provides the content of the original page in JSON format. I want to format another webpage's content into JSON so I tried copying the method used, (Also the link my professor provided for an assignment uses the same format) and I get nothing. http://www.programmableweb.com/apitag/weather?format=application/JSON Clicking the link from here leads to a search of the website via a search engine. Copy pasting that exact same link just takes you to the actual webpage. My question is, why cant I just append ?format=application/JSON to any url for the JSON format of the webpage?
If it matters I'm trying to get JSON data to display via a Chrome extension.
My question is, why cant I just append ?format=application/JSON to any url for the JSON format of the webpage?
Because a URL is just data, and there is nothing standard about a query string parameter called "format". The server has to be designed to give you JSON before it can or will do that.
That particular website simply provides a feature where you can get the same data in an alternate format such as JSON. Not all websites provide features like that, and not all of them implement it with the same URL parameter. Some sites may have URLs ending with .html be HTML pages and ones ending with .json provide the same info in JSON. Others might provide a separate API. You might check that website to see if it has a "developers" section that gives information on their API, if they have one.
I was wondering if I could get some data from another website to get it displayed on mine. The good example can be alexa.com. I need to display Alexa traffic rank and reputation in a div for example on my page, so it will be changed dynamically each time Alexa change its data.
Thank you for your help.
One way is to make an ajax request for the Alexa.com site, once you receive all the html, then you can use jquery or something to scrape it for the div you want.
It feels kinda dirty, but its an easy way to get what you want. Though this is assuming their page content isn't loaded dynamically.
Edit: See this for more info: Request external website data using jQuery ajax
yahoo yql... (instead of a php? proxy serverside script)..
I have a sneaky suspicion you do not own/control the external link site, so getting content from a different site, would fall under cross-domain security restrictions (to a modern browser).
So in order to regain 'power to the user', just use http://query.yahooapis.com/.
jQuery would not be strictly needed.
EXAMPLE 1:
Using the SQL-like command:
select * from html
where url="http://stackoverflow.com"
and xpath='//div/h3/a'
The following link will scrape SO for the newest questions (bypassing cross-domain security bull$#!7):
http://query.yahooapis.com/v1/public/yql?q=select%20title%20from%20html%20where%20url%3D%22http%3A%2F%2Fstackoverflow.com%22%20and%0A%20%20%20%20%20%20xpath%3D%27%2F%2Fdiv%2Fh3%2Fa%27%0A%20%20%20%20&format=json&callback=cbfunc
As you can see this will return a JSON array (one can also choose xml) and calling the callback-function: cbfunc.
Indeed, as a 'bonus' you also save a kitten every time you did not need to regex data out of 'tag-soup'.
Do you hear your little mad scientist inside yourself starting to giggle?
Then see this answer for more info (and don't forget it's comments for more examples).
Good Luck!
I've coded an HTML page using jQuery for loading content. Now if I want to link directly to a submenu, is this possible to do with JavaScript?
So for example if someone goes to www.mydomain.com/submenu1/
then some JavaScript code will execute and load the needed contents?
Thanks a lot :)
Is it possible to realize that with htaccess?
You will more likely want to have a URL structure that only needs a page to load from the server once, then the server is only queried by JavaScript XMLHttpRequests. Loading content based on a "hard" URL would be pointless, since you're doing a server request anyways and might as well return the content in the response.
For keeping addresses unique while still keeping the "hard" URL the same (preventing multiple server requests), you can use the hash/anchor part of the URL. This means your address might look something like this: http://www.example.com/#/submenu1/
The #/submenu1/ part stays on the client, so only / on www.example.com is requested. Then it's up to your JavaScript to load the content relevant to /submenu1/. See a page of mine for an example of this: http://blixt.org/js#project/hash?view=code
Also have a look at this question: Keeping history of hash/anchor changes in JavaScript