Web crawler encounter javascript - javascript

I'm new to web crawling. I am trying to crawl a webpage using java and I encounter a problem. I need to get the link in a 'HTML Tag' whose href is a javascript function. I have no idea how to get the link in the javascript function. Here is the html source and javascript source.
HTML
<a href='javascript:ShowPostGridUnique(205316,0);'>link</a>
JSShowPostGridUnique
function ShowPostGridUnique(parentpostid, pageShow) {
//alert(parentpostid);
var divid;
divid = 'divPostContent' + parentpostid;
if (document.getElementById(divid).className == 'divGridShow') {
document.getElementById(divid).className = 'divGridHide';
document.getElementById(divid).innerHTML = '';
}
else {
document.getElementById(divid).className = 'divGridShow';
// call server side method
PageMethods.divParentInnerHtml( parentpostid, pageShow, CallSuccessShowPost, CallFailedAlert, parentpostid);
try {
divid = 'TDtitle' + parentpostid;
document.getElementById(divid).className = 'TDtitle';
divid = 'TDPage' + parentpostid;
document.getElementById(divid).className = 'TDtitle';
}
catch (err) {
//Handle errors here
}
}
}
How can i get the link of href? Thanks.

Use Headless Browser like Phantomjs.
http://phantomjs.org/
Use ghostdriver/selenium to control Phantomjs
https://github.com/SeleniumHQ/selenium
https://github.com/detro/ghostdriver

Related

Is there a way to jump back and forth from python and JS?

I'm trying to create a chrome extension which are made in JS, however I built the main part of the code in python because it is the language I know, and I'm not sure if what I'm trying to do is even possible in JS. Is there a way to take in a variable I get from JS, use it to run a python function, and then continue with the JS code?
chrome.browserAction.onClicked.addListener(function(activeTab){
chrome.tabs.getSelected(null, function(tab) {
var tabId = tab.id;
tabUrl = tab.url;
id = tabUrl.split("=")[1];
//call python func here
});
var newURL = "frame.html";
chrome.tabs.create({ url: newURL});
});
from youtube_transcript_api import YouTubeTranscriptApi
import webbrowser
def func(id):
html_file = open('frame.html','w')
list = YouTubeTranscriptApi.get_transcript(id)
webbrowser.get('C:/Program Files/Google/Chrome/Application/chrome.exe %s')
data = '<html>\n<head>\n<link rel="stylesheet" href="styles.css">\n</head>\n<body>\n<p>\n'
for i in list:
data += i.get('text') + ' '
data += '</p>\n</body>\n</html>'
html_file.write(data)
html_file.close()
#func(id from java)

.net core mvc get controller method return the downloading file , but not working?

I have List<model> and I convert to JSON in Javascript and when I click button call controller method
and pass paramater like this :
$('#exceldownload').click(function(){
var json = #Html.Raw(Newtonsoft.Json.JsonConvert.SerializeObject(Model.ReportListModel,Newtonsoft.Json.Formatting.Indented));
json = JSON.stringify(json);
window.location = "#Url.Action("ReportExcel","Report")?model="+json+"";
});
And Controller Code :
public FileResult ReportExcel(string model)
{
var b = JsonConvert.DeserializeObject<List<ReportListModel>>(model);
if (b.Count == 0)
{
return File(Encoding.UTF8.GetBytes("empty"), "text/plain", "empty");
}
else
{
DataTable table = (DataTable)JsonConvert.DeserializeObject(JsonConvert.SerializeObject(b), (typeof(DataTable)));
using (var excelPack = new ExcelPackage())
{
var ws = excelPack.Workbook.Worksheets.Add("WriteTest");
ws.Cells.LoadFromDataTable(table, true, OfficeOpenXml.Table.TableStyles.Light8);
var FileBytesArray = excelPack.GetAsByteArray();
return File(FileBytesArray, "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "test.xlsx");
}
}
}
But when I click button and getting like this :
This site can't be reached,
Localhost refused to connect,
ERR_CONNECTION_CLOSED
I want to when I click button download excel file.
It's crashing at this line:
window.location = "#Url.Action("ReportExcel","Report")?model="+json+"";
Change it to
window.location = #Url.Action("ReportExcel","Report") + "?model="+json+"";
Try debugging the ASP.NET code. Probably an internal server error occurs.

Cannot delete file because it is being used by another process, ASP.NET Core MVC

I am using ASP.Net Core with MVC for creating an app. I am using visual studio and IIS express currently.
Below is my current project structure:
*project directory
-wwwroot
-areas
-attachments
-controllers
-models
-views
I currently store images inside the attachments folder.
Previously I have written something like that inside my startup.cs
app.UseStaticFiles(new StaticFileOptions
{
FileProvider = new PhysicalFileProvider(
Path.Combine(Directory.GetCurrentDirectory(), "Attachments")),
RequestPath = "/Attachments"
});
I have also done something like this below:
appendImage(#Url.Content("~/Attachments/")+result.fileName);
I did this to display an image on my view. The image is displayed successfully.
What I am trying to achieve now is the on the UI allow the user to make a choice to delete the files inside that attachments folder
I tried the following code:
string contentRootPath = _hostingEnvironment.ContentRootPath;
string fullImagePath = Path.Combine(contentRootPath + "\\Attachments", currentItemToDelete.FileName);
if (System.IO.File.Exists(fullImagePath))
{
try{
System.IO.File.Delete(fullImagePath);
}catch(Exception e){
operationResult = "Attachment Path. Internal Server Error";
}
}
The execution does enter the if (System.IO.File.Exists(fullImagePath))
but it raises an exception when it reaches System.IO.File.Delete. The exception states that the file which resides in that path is being used by another process. And thus I cannot delete the file. The only process that is accessing the file is the web app I am creating/debugging at the same time. How do I prevent this exception from happening? Do I have to use other kind of code to delete the file ?
EDIT to include more details:
Inside my view(index.cshtml):
appendImage is a javascript function:
function appendImage(imgSrc) {
var imgElement = document.createElement("img");
imgElement.setAttribute('src', imgSrc);
if (imgSrc.includes(null)) {
imgElement.setAttribute('alt', '');
}
imgElement.setAttribute('id', "img-id");
var imgdiv = document.getElementById("div-for-image");
imgdiv.appendChild(imgElement);
}
That function is called below:
$.ajax({
url:'#Url.Action("GetDataForOneItem", "Item")',
type: "GET",
data: { id: rowData.id },
success: function (result) {
removeImage();
appendImage(#Url.Content("~/Attachments/")+result.fileName);
$("#edit-btn").attr("href", '/Item/EditItem?id=' + result.id);
},
error: function (xhr, status, error) {
}
});
After calling appendImage(); I change the href of a <a> tag. When the user clicks on the link, the user is directed to another page(edit.cshtml). In the page, the image which resides in that path is also being displayed with code like this:
<img src="#Url.Content("~/Attachments/"+Model.FileName)" alt="item image" />
In this new page(edit.cshtml), there is a delete button. Upon clicking the delete button, the execution of the program goes to the controller which is this controller function:
[HttpPost]
public string DeleteOneItem(int id)
{
//query the database to check if there is image for this item.
var currentItemToDelete = GetItemFromDBDateFormatted(id);
if (!string.IsNullOrEmpty(currentItemToDelete.FileName))
{
//delete the image from disk.
string contentRootPath = _hostingEnvironment.ContentRootPath;
string fullImagePath = Path.Combine(contentRootPath + "\\Attachments", currentItemToDelete.FileName);
if (System.IO.File.Exists(fullImagePath))
{
try
{
System.IO.File.Delete(fullImagePath);
}catch(Exception e)
{
}
}
}
return "";
}
EDIT to answer question:
Add in
System.GC.Collect();
System.GC.WaitForPendingFinalizers();
before system.io.file.delete
you can replace your C# method DeleteOneItem with this given code. may be it might work.
[HttpPost]
public string DeleteOneItem(int id)
{
//query the database to check if there is image for this item.
var currentItemToDelete = GetItemFromDBDateFormatted(id);
if (!string.IsNullOrEmpty(currentItemToDelete.FileName))
{
//delete the image from disk.
string contentRootPath = _hostingEnvironment.ContentRootPath;
string fullImagePath = Path.Combine(contentRootPath + "\\Attachments", currentItemToDelete.FileName);
if (System.IO.File.Exists(fullImagePath))
{
try
{
System.GC.Collect();
System.GC.WaitForPendingFinalizers();
System.IO.File.Delete(fullImagePath);
}
catch (Exception e) { }
}
}
return "";
}
try
{
System.GC.Collect();
System.GC.WaitForPendingFinalizers();
System.IO.File.Delete(fullImagePath);
}
catch(Exception e){
}

How to give a notice message before redirect to login view when use Asp.Net Identity?

I have used the Asp.Net Identity framework in my app.There is a need, when the session expires give a prompt message, and then jump to the login page instead of directly jump to the login page.Prompt information using custom styles.
Because my app's left menus load the view with ajax,so I overried the AuthorizeAttribute.HandleUnauthorizedRequest methord to return a json.Now when users click left menus, it can work properly.But if users refresh the page by click F5,the page will still jump directly to the login page.
I have already overrided AuthorizeAttribute.HandleUnauthorizedRequest
protected override void HandleUnauthorizedRequest(AuthorizationContext filterContext)
{
var httpContext = filterContext.HttpContext;
string sintab = httpContext.Request["inTab"];
if (!string.IsNullOrEmpty(sintab) && bool.Parse(sintab))
{
var result = new JsonResult();
result.Data = new
{
Authorize = false,
Url = LOGIN_URL
};
result.JsonRequestBehavior = JsonRequestBehavior.AllowGet;
filterContext.Result =result;
return;
}
if (filterContext.Controller.GetType() != typeof(Controllers.HomeController) &&
!filterContext.ActionDescriptor.ActionName.Equals("Index", StringComparison.OrdinalIgnoreCase))
{
string returnUrl = "/" + filterContext.Controller.GetType().Name.Replace("Controller","") + "/Index" ;
returnUrl = httpContext.Server.UrlEncode(returnUrl);
httpContext.Response.Redirect("~/Account/Login?ReturnUrl="+returnUrl);
return;
}
base.HandleUnauthorizedRequest(filterContext);
}
The code of left menus' loadView js
$.get(url, null, function (html) {
html = html.replace(/#%/g, "\"").replace(/%#/g, "\"");
var json;
try {
json = eval("(" + html + ")");
} catch (e) {
}
if (json && !json.Authorize) {
// give an message
layer.alert("Session timeout, please re login.", function (index) {
window.location.href = json.Url + "?returnurl=" + encodeURI(hash);
});
}
else {
$("#content").empty().html(html);
_initModalButton();
$("#content").show();
}
}, 'html');
The page looks like this image
I want to know if there are some better ways to do this because there are a lot of other button need to check authorize status and show message before jump to the login page,and how to give the message when users refresh the page?
Thanks very much!
I think you're looking for are Global Ajax Events.
Please, check this, I think this make your job easier.

AJAX parse + Yahoo YQL returning no results?

I'm working on a script that gets all the <table> elements from an external website by going through Yahoo's YQL. This has worked fine recently, but it stopped working as of today. I'm not entirely sure why, all websites used to work with this code:
<script type="text/javascript">
$(document).ready(function () {
var container = $('#target');
function doAjax(url) {
if (url.match('^http')) {
$.getJSON("http://query.yahooapis.com/v1/public/yql?"
+ "q=select%20*%20from%20html%20where%20url%3D%22"
+ encodeURIComponent(url)
+ "%22&format=xml'&callback=?",
function (data) {
if (data.results[0]) {
var fullResponse = $(filterData(data.results[0])),
justTable = fullResponse.find("body");
container.append(justTable);
} else {
var errormsg = '<p>Error: could not load the page.</p>';
container.html(errormsg);
}
});
} else {
$('#target').load(url);
}
}
function filterData(data) {
data = data.replace(/<?\/body[^>]*>/g, '');
data = data.replace(/[\r|\n]+/g, '');
data = data.replace(/<--[\S\s]*?-->/g, '');
data = data.replace(/<noscript[^>]*>[\S\s]*?<\/noscript>/g, '');
data = data.replace(/<script[^>]*>[\S\s]*?<\/script>/g, '');
data = data.replace(/<script.*\/>/, '');
data = data.replace(/<img[^>]*>/g, '');
return data;
}
doAjax('http://www.google.com');
});
</script>
I changed the url to google and changed it to find the <body> tag instead of <table> tags to better show its not working. I looked at the URL that it's requesting and it's not showing any content. Not sure what the problem is though.
Have you checked if the "external website" you have crawled has structural changes?
When it has worked before and now not anymore, then my tip is that the site structure has changed.
It looks like the problem was that YQL was down? I just tested it again and it worked out fine. I wish they would tell us in the future if an outage occurred.

Categories