Tesseract.js doesn't recognize Arabic language

Tesseract.js doesn't recognize Arabic language - javascript

I'm using tesseract.js ORC library to read what is written on an image and write it in console or on a text file so I found this library and it's working find with English word or characters but when I tried to read what is written on the image in Arabic language it doesn't work so this is the image that I'm trying to read
and this is my code :-
Head Tag:-
<script src='https://unpkg.com/tesseract.js#v2.1.0/dist/tesseract.min.js'</script>
Body Tag:-
<script>
Tesseract.recognize(
'image.png',
'ara',
{ logger: m => console.log(m) }
).then(({ data: { text } }) => {
})
</script>

You need to handle the result inside the .then() block.
Here's a working example :
<!DOCTYPE html>
<meta charset="utf-8">
<title> OCR TEST </title>
<script src="https://unpkg.com/tesseract.js#v2.1.0/dist/tesseract.min.js"></script>
<output id="result">Processing...</output>
<script>
const output = document.querySelector('output#result');
Tesseract.recognize(
'https://i.imgur.com/mdzmK4w.png', 'ara'
).then(result => {
output.value = result.data.text;
}).catch(err => {
output.value = 'Processing Failed';
console.log(err);
});
</script>
It can take some time to process, especially on the first load because the WASM module and OCR training data will be fetched in the background first. You can see this happening in the Network tab of your Dev Tools.

Related

Extract text with cheerio

I'm trying to write a script to extract email id and name from this website. I tried the following snippet but it doesn't work.
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<title>foo</title>
<meta name="description" content="">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="">
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
</head>
<body>
<div>
<strong style="color: darkgreen;">Can read this tag</strong>
<object id="external_page" type="text/html" data="https://aleenarais.com/buddy/" width="800px" height="600px"
style="overflow:auto;border:5px ridge blue">
<!-- I want to read tag values from this object -->
</object>
</div>
<script>
window.addEventListener('load', function () {
const item = [];
$('strong[style="color: darkgreen;"]').each(function () {
item.push($(this).text())
})
console.log(item)
})
</script>
</body>
</html>
Is there any better way to do this? Or is it possible to convert the whole page into a string and extract the email using RegEx?

The email and name of in the webpage are being rendered in an iframe. The source of iframe is an external source. In order for you to extract the information, you need to use a headless browser to do that.
I would suggest using Node.JS & Puppeteer (https://www.npmjs.com/package/puppeteer)
const puppeteer = require("puppeteer");
(async() => {
const url = "https://aleenarais.com/buddy/";
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, {
waitUntil: "networkidle0"
});
var frames = await page.frames();
var myframe = frames.find(
(f) => f.url().indexOf("https://feedium.app/fetchh.php") > -1
);
const textFeed = await myframe.$$eval("strong", (sElements) =>
sElements.map((el) => el.textContent)
);
console.log(textFeed.splice(1)); //Array contains both name and email
await browser.close();
})();
Puppeteer loads the page similar to how a user loads the page. It waits until all the network calls are done (see network idle0) and then it tries finding the iframe which has the url (fetchh.php). If you observe, name and email are present in strong tags and they are the only strong tags available. Hence, we are extracting the strong tags, removing the count and we are left with just the name and email.
Output:
[ 'JJ', 'j*j#gmail.com' ] //I have just masked the values but the program gives the actual ones
Steps to run the script:
Install Node.Js (https://nodejs.org/en/download/)
Install puppeteer using (npm i puppeteer)
copy the script and place it in file (demo.js)
In the terminal, navigate to the directory in which the demo.js is
present and then run node demo.js
You should see the output.

Try this:
window.addEventListener('load', function () {
let item = [];
$('strong[style*="color: darkgreen;"]').each(function (index, item) {
item.push($(this).text())
})
console.log(item)
}

Issues in word file that generated by using html-dock-js-typescript

I used
to generate html content into word file. But there ware unknown symbols and characters in generated word file.
Html file -
Generated word file -
My sample code -
const opt = {
margin: {
top: 100
},
orientation: 'portrait' as const
};
const doc = document.getElementById('my-doc');
const contract = new XMLSerializer().serializeToString(doc);
asBlob(contract, opt).then(data => {
saveAs(data, 'my_word.docx');
});

hopefully im not late to answer this question. You need to add
<!DOCTYPE html> <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head> in your HTML output.

Chrome Extension to pull text from script

I'm attempting to pull "webId%22:%22" var string from a script tag using JS for a Chrome extension. The example I'm currently working with allows me to pull the page title.
// payload.js
chrome.runtime.sendMessage(document.title);
// popup.js
window.addEventListener('load', function (evt) {
chrome.extension.getBackgroundPage().chrome.tabs.executeScript(null, {
file: 'payload.js'
});;
});
// Listen to messages from the payload.js script and write to popout.html
chrome.runtime.onMessage.addListener(function (message) {
document.getElementById('pagetitle').innerHTML = message;
});
//HTML popup
<!doctype html>
<html>
<head>
<title>WebID</title>
<script src="popup.js"></script>
<script src="testButton.js"></script>
</head>
<body>
<button onclick="myFunction()">Try it</button>
<p id="body"></p>
<h3>It's working</h1>
<p id='pagetitle'>This is where the webID would be if I could get this stupid thing to work.</p>
</body>
</html>
What I am trying to pull is the webID from below:
<script if="inlineJs">;(function() { window.hydra = {}; hydra.state =
JSON.parse(decodeURI("%7B%22cache%22:%7B%22Co.context.configCtx%22:%7B%22webId%22:%22gmps-salvadore%22,%22locale%22:%22en_US%22,%22version%22:%22LIVE%22,%22page%22:%22HomePage%22,%22secureSiteId%22:%22b382ca78958d10048eda00145edef68b%22%7D,%22features%22:%7B%22directivePerfSwitch%22:true,%22enable.directive.localisation%22:true,%22enable.directive.thumbnailGallery%22:true,%22enable.new.newstaticmap%22:false,%22disable.forms.webId%22:false,%22use.hydra.popup.title.override.via.url%22:true,%22enable.directive.geoloc.enableHighAccuracy%22:true,%22use.hydra.theme.service%22:true,%22disable.ajax.options.contentType%22:false,%22dealerLocator.map.use.markerClustering%22:true,%22hydra.open.login.popup.on.cs.click%22:false,%22hydra.consumerlogin.use.secure.cookie%22:true,%22use.hydra.directive.vertical.thumbnailGallery.onpopup%22:true,%22hydra.encrypt.data.to.login.service%22:true,%22disable.dealerlocator.fix.loading%22:false,%22use.hydra.date.formatting%22:true,%22use.hydra.optimized.style.directive.updates%22:false,%22hydra.click.pmp.button.on.myaccount.page%22:true,%22use.hydra.fix.invalid.combination.of.filters%22:true,%22disable.vsr.view.from.preference%22:false%7D%7D,%22store%22:%7B%22properties%22:%7B%22routePrefix%22:%22/hydra-graph%22%7D%7D%7D")); }());</script>

You have inline code in onclick attribute which won't work, see this answer.
You can see an error message in devtools console for the popup: right-click it, then "Inspect".
The simplest solution is to attach a click listener in your popup.js file.
The popup is an extension page and thus it can access chrome.tabs.executeScript directly without chrome.extension.getBackgroundPage(), just don't forget to add the necessary permissions in manifest.json, preferably "activeTab".
For such a simple data extraction you don't even need a separate content script file or messaging, just provide code string in executeScript.
chrome.tabs.executeScript({
code: '(' + (() => {
for (const el of document.querySelectorAll('script[if="inlineJs"]')) {
const m = el.textContent.match(/"([^"]*%22webId%22[^"]*)"/);
if (m) return m[1];
}
}) + ')()',
}, results => {
if (!chrome.runtime.lastError && results) {
document.querySelector('p').textContent = decodeURI(results[0]);
}
});

HTML with chinese unicode to png?

I'm trying to render this html document ./tagslegend.html with npm package wkhtmltox:
<!doctype html>
<html>
<head>
<style>
.cmn {
font-family: 'WenQuanYi Micro Hei';
}
</style>
</head>
<body>
<dl>
<dt class="cmn">中文</dt><dd>In mandarin language.</dd>
</dl>
</body>
</html>
Here's the javascript:
const express = require('express');
const fs = require('fs');
const wkhtmltox = require('wkhtmltox');
const app = express();
const converter = new wkhtmltox();
app.get('/tagslegend.png', (request, response) => {
response.status(200).type('png');
converter.image(fs.createReadStream('tagslegend.html'), { format: "png" }).pipe(response);
});
var listener = app.listen(process.env.PORT, function () {
console.log('App listening on port ' + listener.address().port);
});
I expect it to render like my browser would render that same html:
But am instead getting a png like this:
How can I fix this and make it render like the first image?
I have that font installed on the server:
$ fc-list | grep 'Wen'
/app/.fonts/WenQuanYi Micro Hei.ttf: WenQuanYi Micro Hei,文泉驛微米黑,文泉驿微米黑:style=Regular

This looks like an character encoding problem. It seems as if fs.createReadStream() is reading your HTML as ISO-8859-1, when it really should be reading it as UTF-8 — which is odd, since UTF-8 is the default encoding.
I'd make sure tagslegend.html is properly saved as a UTF-8 file. It couldn't hurt to explicitly declare:
<meta charset="utf-8">
...in the <head> section of your HTML as well.

Silence net::ERR_CONNECTION_REFUSED

Connecting to a non-existent web socket server results in loud errors being logged to the console, usually to the tune of ... net::ERR_CONNECTION_REFUSED.
Anyone have an idea for a hackaround to silence this output? XMLHttpRequest won't work since it yields the same verbose error output if the server is not reachable.
The goal here is to test if the server is available, if it is then connect to it, otherwise use a fallback, and to do this without spamming the console with error output.

Chrome itself is emitting these messages, and there is no way to block them. This is a function of how chrome was built; whenever a ResourceFetcher object attempts to fetch a resource, its response is passed back to its context, and if there's an error, the browser prints it to the console - see here.
Similar question can be found here.
If you'd like, you can use a chrome console filter as this question discusses to block these errors in your console, but there is no way to programmatically block the messages.

I don't know why do you want to prevent this error output. I guess you just want to get rid of them when debugging. So I provide a work around here may be just useful for debugging.
Live demo: http://blackmiaool.com/soa/43012334/boot.html
How to use it?
Open the demo page, click the "boot" button, it will open a new tab. Click the "test" button in the new tab and check the result below. If you want to get a positive result, change the url to wss://echo.websocket.org.
Why?
By using post message, we can make browser tabs communicate with each other. So we can move those error output to a tab that we don't concern.
P.S. You can refresh the target page freely without loosing the connection between it and boot page.
P.P.S You can also use storage event to achieve this.
boot.html:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<title>boot page</title>
</head>
<body>
<button onclick="boot()">boot</button>
<p>BTW, you can boot the page without the button if you are willing to allow the "pop-up"</p>
<script>
var targetWindow;
function init() {
targetWindow
}
function boot() {
targetWindow = window.open("target.html");
}
boot();
window.addEventListener('message', function(e) {
var msg = e.data;
var {
action,
url,
origin,
} = msg;
if (action === "testUrl") {
let ws = new WebSocket(url);
ws.addEventListener("error", function() {
targetWindow.postMessage({
action: "urlResult",
url,
data: false,
}, origin);
ws.close();
});
ws.addEventListener("open", function() {
targetWindow.postMessage({
action: "urlResult",
url,
data: true,
}, origin);
ws.close();
});
}
});
</script>
</body>
</html>
target.html
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<title>target page</title>
</head>
<body>
<h4>input the url you want to test:</h4>
<textarea type="text" id="input" style="width:300px;height:100px;">
</textarea>
<br>
<div>try <span style="color:red">wss://echo.websocket.org</span> for success result(may be slow)</div>
<button onclick="test()">test</button>
<div id="output"></div>
<script>
var origin = location.origin;
var testUrl = origin.replace(/^https?/, "ws") + "/abcdef"; //not available of course
document.querySelector("#input").value = testUrl;
function output(val) {
document.querySelector("#output").textContent = val;
}
function test() {
if (window.opener) {
window.opener.postMessage({
action: "testUrl",
url: document.querySelector("#input").value,
origin,
}, origin);
} else {
alert("opener is not available");
}
}
window.addEventListener('message', function(e) {
var msg = e.data;
if (msg.action === "urlResult") {
output(`test ${msg.url} result: ${msg.data}`);
}
});
</script>
</body>
</html>

We Keep Coding

JavaScript is the programming language of the Web.

Tesseract.js doesn't recognize Arabic language - javascript

Related

Extract text with cheerio

Issues in word file that generated by using html-dock-js-typescript

Chrome Extension to pull text from script

HTML with chinese unicode to png?

Silence net::ERR_CONNECTION_REFUSED

Categories

Resources