-
Notifications
You must be signed in to change notification settings - Fork 36
Handbook
- Features
- Performance
- Installation
- Usage
- Debug
- Score Rule
- Extract Selectors
- Image Fallback
- Threshold
- Customize Settings
- Output
- Notes
- Fast And Shoot Straight.
- High Performance - Less memory
- Automatic Read Title & Content
- Follow Redirects
- Automatic Decoding Content Encodings(Avoid Messy Codes, Especially Chinese)
- Gzip/Deflate Support
- Proxy Support
- Auto-generate User-Agent
- Free and extensible
In my case, the speed of spider is about 700 thousands documents per day, 22 million per month, and the maximize crawling speed is 450 per minute, avg 80 per minute, the memory cost are about 200 megabytes on each spider kernel, and the accuracy is about 90%, the rest 10% can be fixed by customizing Score Rules or Selectors. it's better than any other readability modules.
## InstallationServer infos:
- 20M bandwidth of fibre-optical
- 8 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz cpus
- 32G memory
npm install read-art --productionread(<html|uri|cheerio> [, options], [callback])It supports the definitions such as:
- html|uri|cheerio Html, Uri string or Cheerio instance.
-
options An optional options object, including:
- output The data type of article content, head over to Output to get more information.
-
killBreaks A value indicating whether or not kill breaks, blanks, tab symbols(\r\t\n) into one
<br />,trueby default. - betterTitle Defined how to extract the title, read more.
-
minTextLength If the content is less than
[minTextLength]characters, don't even count it,25by default. -
minParagraphs A number indicates whether or not take the top candidate as a article candidate,
3by default, i.e.: IftopCandidatedom has more than3<p>children,topCandidatewill be considered as the article dom, otherwise, it will be the parent oftopCandidate(not<body>). -
tidyAttrs Remove all the attributes on elements,
falseby default. -
keepAllLinks A value indicates whether or not keep all the links, especially the useless anchors such as pagination, print, email and so on,
falseby default. -
forceDecode A value indicates whether or not decode the full text/html by (https://github.com/fb55/entities)[entities],
falseby default. -
minRelatedDensity A number (
0~1) indicates whether or not the specifiednode(related articles) should be removed automatic (0.8by default), works whenforceRemoveRelated: true. -
forceRemoveRelated A value indicates whether or not the related related articles should be removed,
falseby default. -
dom Will return the whole cheerio dom (proceeded) when this property is set to
true,falseby default, try to useart.domto get the dom object in callback function (uses the$_to get the original). -
damping The damping to calculate score of parent node,
1/2by default. e.g.: the score of current document node is20, the score of parent will be20 * damping. - scoreRule Customize the score rules of each node, one arguments will be passed into the callback function, read more.
- selectors Customize the data extract selectors.
- imgFallback Customize the way to get source of image, read more.
- thresholdScore A number/function indicates whether or not drop the article content, read more.
-
thresholdLinkDensity A
0~1decimal indicates whether or not drop the article content, read more. - options from cheerio
- options from req-fast
-
callback Fire after the article has been crawled -
callback(error, article, options, response), arguments are:-
error
Errorobject when exception has been caught. -
article The article object, including:
article.title,article.contentandarticle.html. - options The request options.
-
response The response of your request, including:
response.headers,response.redirects,response.cookiesandresponse.statusCode.
-
error
### Access `Reader` and `Article`Head over to test or examples directory for a complete example.
var read = require('read-art');
// read.Reader.extractProp(...args)
// var art = new read.Article([OBJ]);var read = require('read-art');
// read from google:
read('http://google.com', function(err, art, options, resp){
if(err){
throw err;
}
var title = art.title, // title of article
content = art.content, // content of article
html = art.html; // whole original innerHTML
console.log('[STATUS CODE]', resp && resp.statusCode);
});
// or:
read({
uri: 'http://google.com',
charset: 'utf8'
}, function(err, art, options, resp){
});
// what about html?
read('<title>node-art</title><body><div><p>hello, read-art!</p></div></body>', function(err, art, options, resp){
});
// of course could be
var $ = cheerio.load('<title>node-art</title><body><div><p>hello, read-art!</p></div></body>')
read({
cheerio: $
}, function(err, art, options, resp){
});
read({
uri: '<title>node-art</title><body><div><p>hello, read-art!</p></div></body>'
}, function(err, art, options, resp){
});
/*
Promises
*/
read('<body><p>Hello, read-art</p></body>')
.then((article) => {
console.log(article) // -> object with all the info
})
.catch((err) => {
// Manage error
})CAUTION: Title must be wrapped in a <title> tag and content must be wrapped in a <body> tag.
With High Availability: spider2
## DebugSee the debug module.
read-art.main
read-art.article
read-art.doctype
read-art.reader
$ DEBUG=read-art.* node app.jsThe betterTitle option could be one type of following:
- Number The maximize length of title.
-
Function Customize the title by
betterTitle: function(title){}, the whole title will be passed in.
In some situations, we need to customize score rules to crawl the correct content of article, such as BBS and QA forums. There are two effective ways to do this:
-
minTextLength It's useful to get rid of useless elements (
P/DIV), e.g.minTextLength: 100will dump all the blocks thatnode.text().lengthis less than100. -
scoreRule You can customize the score rules manually, e.g.:
scoreRule: function(node){ if (node.hasClass('w740')) { return 100; } }The elements which have the
w740className will get100bonus points, that will make thenodeto be the topCandidate, which means it's enough to make thetextofDIV/P.w740to be the content of current article.node The cheerio object.
read('http://club.autohome.com.cn/bbs/thread-c-66-37239726-1.html', {
minTextLength: 0,
scoreRule: function(node){
if (node.hasClass('w740')) {
return 100;
}
}
}, function(err, art){
});Some times we wanna extract article somehow, e.g.: pick the text of .article>h3 as title, and pick .article>.author as the author data:
read({
html: '<title>read-art</title><body><div class="article"><h3 title="--read-art--">Who Am I</h3><p class="section1">hi, dude, i am <b>readability</b></p><p class="section2">aka read-art...</p><small class="author" data-author="Tjatse X">Tjatse</small></div></body>',
selectors: {
title: {
selector: '.article>h3',
extract: ['text', 'title']
},
content: '.article p.section1',
author: {
selector: '.article>small.author',
skipTags: 'a,span',
extract: {
shot_name: 'text',
full_name: 'data-author',
custom_name: function(node, options) {
return options.source + ':' + node.text
}
}
},
source: {
selector: '.article>.pub_source',
skipTags: false,
extract: function(node, options) {
return options.domain + ':' + node.text()
}
}
},
}, function (err, art) {
// art.title === {text: 'Who Am I', title: '--read-art--'}
// art.content === 'hi, dude, i am <b>readability</b>'
// art.author === {shot_name: 'Tjatse', full_name: 'Tjatse X'}
});Properties:
-
selector the query selector, e.g.:
#article>.title,.articles:nth-child(3) -
extract the data that you wanna extract, could be one of following:
-
Array: each item should be a prop name, e.g.:text=>node.text()href=>node.attr('href')data-src=>node.attr('src') -
Object: key-value pairs, the key is a field name, and the value could be:propName: property name ofnodeFunction: callback funtion with two aguments passing in -nodeandoptions.
-
-
skipTags a value indicating whether or not skip the tags (remove)
-
Booleanskip nothing if this property is set tofalse -
Stringskip the specific tags (supports CSS3 selectors)
-
Notes The binding data will be an object or array (object per item) if the extract option is an array object, title and content will override the default extracting methods, and the output of content depends on the output option.
Should be one of following types:
-
Boolean Fallback to
img.src = (node.data('src') || node.attr('data-src'))when set totrue. -
String Customize the attribute name, it will take
node.attr([imgFallback])assrcofimg. -
Function Give users maximum customizability and scalability of source attribute on
img, you can even override the src. e.g.:imgFallback: function(node, src){ if (src){ return src; } return node.attr('base') + '/' + node.attr('rel-path'); }
read({
imgFallback: true
}, function(err, art){});
read({
imgFallback: 'the-src-attr'
}, function(err, art){});
read({
imgFallback: function(node, src){
if (src){
return src;
}
return node.attr('base') + '/' + node.attr('rel-path');
}
}, function(err, art){});Customize the threshold of anchors and nodes' scores.
### ScoreThe thresholdScore is a threshold number which to identify whether or not to discard children of top candidate directly (skip deeper tag/text/link density checking), should be one of following types:
- Number A finite number.
-
Function Calculate the threshold score by yourself, two arguments are passing in:
- node The top candidate (mostly like article dom).
-
scoreKey The data key to storage score, you can get score by
node.data(scoreKey).
After read-art got the top candidate, it starts to analyze the children of top candidate, if the score of current child is greater than thresholdScore, the child will be appended to article body directly.
Math.max(10, topCandidate.data(scoreKey) * 0.2) by default.
read({
thresholdScore: 20
}, function(err, art){});
read({
thresholdScore: function(node, scoreKey){
return Math.max(10, node.data(scoreKey) * 0.2);
}
}, function(err, art){});thresholdLinkDensity is used to identify whether current child of top candidate is a navigator, ad or relative-list, 0.25 by default, so if the text length of anchors in current child devides by text length of top candidate is greater than thresholdLinkDensity, the child will be discarded.
read({
thresholdLinkDensity: 0.25
}, function(err, art){});We're using different regexps to iterates over elements (cheerio objects), and removing undesirable nodes.
read.use(function(){
//[usage]
});The [usage] could be one of following:
-
this.reset()Reset the settings to default. -
this.relatedTags([tags], [override])Remove related articles by tagName, e.g.this.skipTags('dl,table'), if[override]is set totrue,relatedtagswill be"dl,table", otherwise it will be appended to the origin, i.e. :ul,ol,dl,tableif the density of
<a />innode(query document byrelatedTag) greater or equal thenminRelatedDensity, thisnodewill be removed, density equalsnode.find('a').text().length/node.text().length -
this.skipTags([tags], [override])Remove useless elements by tagName, e.g.this.skipTags('b,span'), if[override]is set totrue,skiptagswill be"b,span", otherwise it will be appended to the origin, i.e. :aside,footer,label,nav,noscript,script,link,meta,style,select,textarea,iframe,b,span -
this.medias([medias], [override])Append the medias dom to article by tagName, e.g.this.medias('video,img,embed'), if[override]is set totrue,mediaswill be"video,img,embed", otherwise it will be appended to the origin - the original tags are:img,video,object,embed -
this.regexps.positive([re], [override])Ifpositiveregexp testid+classNameof node success, it will be took as a candidate.[re]is a regexp, e.g./dv101|dv102/will match the element likes<div class="dv101">...or<div id="dv102">..., if[override]is set totrue,positivewill be/dv101|dv102/i, otherwise it will be appended to the origin, i.e. :/article|blog|body|content|entry|main|news|pag(?:e|ination)|post|story|text|dv101|dv102/i -
this.regexps.negative([re], [override])Ifnegativeregexp testid+classNameof node success, it will not be took as a candidate.[re]is a regexp, e.g./dv101|dv102/will match the element likes<div class="dv101">...or<div id="dv102">..., if[override]is set totrue,negativewill be/dv101|dv102/i, otherwise it will be appended to the origin, i.e. :/com(?:bx|ment|-)|contact|comment|captcha|foot(?:er|note)?|link|masthead|media|meta|outbrain|promo|related|scroll|shoutbox|sidebar|sponsor|util|shopping|tags|tool|widget|tip|dialog|copyright|bottom|dv101|dv102/i -
this.regexps.uselessAnchors([re], [override])IfuselessAnchorsregexp testtext contentof adjacent sibling anchors(a+a+a) success, and thekeepAllLinksis set tofalse, the parent node of anchor will be removed - normally it could be the pagination, print/email nodes.[re]is a regexp, e.g./next\spage|previous\spage/iwill match the element likes<a href="2.htm">Next Page</a>or<a href="1.htm">Previous Page</a>, if[override]is set totrue,uselessAnchorswill be/next\spage|previous\spage/i, otherwise it will be appended to the origin, i.e. :/(\d+|next|prev|first|last|print|comment|mail|font|about|contact|(下|下|前|后)一|(首|尾)页)|打印|评论|邮件|信箱|转发|关于|联系|^(大|中|小)$|next\spage|previous\spage/i -
this.regexps.unlikely([re], [override])Ifunlikelyregexp testid+classNameof node success, it probably will not be took as a candidate.[re]is a regexp, e.g./dv101|dv102/will match the element likes<div class="dv101">...or<div id="dv102">..., if[override]is set totrue,unlikelywill be/dv101|dv102/i, otherwise it will be appended to the origin, i.e. :/agegate|auth?or|bookmark|cat|com(?:bx|ment|munity)|date|disqus|extra|foot|header|ignore|link|menu|nav|pag(?:er|ination)|popup|related|remark|rss|share|shoutbox|sidebar|similar|social|sponsor|teaserlist|time|tweet|twitter|\bad[\s_-]?\b|dv101|dv102/i -
this.regexps.maybe([re], [override])Ifmayberegexp testid+classNameof node success, it probably will be took as a candidate.[re]is a regexp, e.g./dv101|dv102/will match the element likes<div class="dv101">...or<div id="dv102">..., if[override]is set totrue,maybewill be/dv101|dv102/i, otherwise it will be appended to the origin, i.e. :/and|article|body|column|main|column|dv101|dv102/i -
this.regexps.div2p([re], [override])Ifdiv2pregexp testid+classNameof node success, all divs that don't have children block level elements will be turned into p's.[re]is a regexp, e.g./<(span|label)/will match the element likes<span>...or<label>..., if[override]is set totrue,div2pwill be/<(span|label)/i, otherwise it will be appended to the origin, i.e. :/<(a|blockquote|dl|div|img|ol|p|pre|table|ul|span|label)/i -
this.regexps.images([re], [override])Ifimagesregexp testsrcattribute of node success, it will be picked as a normalimg, otherwise will be dropped.[re]is a regexp, e.g./\.(gif|jpe?g|png)$/iwill match theimagethatsrclikes/path/to/foo.jpg, if[override]is set totrue,readart.regexps.imageswill be replaced by[re], otherwise it will be appended to the origin.
read.use(function(){
this.reset();
this.skipTags('b,span');
this.regexps.div2p(/<(span|b)/, true);
});You can wrap the content of article with different types, it supports text, html json and cheerio, the output option could be:
-
String
One of types,
htmlby default. -
Object
Key-value pairs including:
- type One of types.
-
stripSpaces
A value indicates whether or not strip the tab symbols (\r\n\t),
falseby default. -
break
A value indicates whether or not split content into paragraphs by
<br />(Only affects JSON output).
Returns the inner text, e.g.:
read('http://example.com', {
output: 'text'
}, function(err, art){
// art.content will be formatted as TEXT
});
// or
read('http://example.com', {
output: {
type: 'text',
stripSpaces: true
}
}, function(err, art){
// art.content will be formatted as TEXT
});Returns the inner HTML, e.g.:
read('http://example.com', {
output: 'html'
}, function(err, art){
// art.content will be formatted as HTML
});
// or
read('http://example.com', {
output: {
type: 'html',
stripSpaces: true
}
}, function(err, art){
// art.content will be formatted as HTML
});Notes Videos could be scraped now, the domains currently are supported: youtube|vimeo|youku|tudou|56|letv|iqiyi|sohu|sina|163.
Returns the restful result, e.g.:
read('http://example.com', {
output: 'json'
}, function(err, art){
// art.content will be formatted as JSON
});
// or
read('http://example.com', {
output: {
type: 'json',
stripSpaces: true,
break: true
}
}, function(err, art){
// art.content will be formatted as Array
});The art.content will be an Array such as:
[
{ "type": "img", "value": "http://example.com/jpg/site1/20140519/00188b1996f214e3a25417.jpg" },
{ "type": "text", "value": "TEXT goes here..." }
]Util now there are only two types - img and text, the src of img element is absolute even if the original is a relative one.
Returns the cheerio node, e.g.:
read('http://example.com', {
output: 'cheerio'
}, function(err, art){
// art.content will be a cheerio node
art.content.find('div.what>ul.you>li.need');
});
// or
read('http://example.com', {
output: {
type: 'cheerio',
stripSpaces: true
}
}, function(err, art){
// art.content will be a cheerio node
art.content.find('div.what>ul.you>li.need');
});Notes The video sources of the sites are quite different, it's hard to fit all in a common way, I haven't find a good way to solve that, PRs are in demand.
## Notes / GotchasPass the charset manually to refrain from the crazy messy codes
read('http://game.163.com/14/0506/10/9RI8M9AO00314SDA.html', {
charset: 'gbk'
}, function(err, art){
// ...
});Generate agent to simulate browsers
read('http://example.com', {
agent: true // true as default
}, function(err, art){
// ...
});Use proxy to avoid being blocked
read('http://example.com', {
proxy: {
host: 'http://myproxy.com/',
port: 8081,
proxyAuth: 'user:password'
}
}, function(err, art){
// ...
});npm test
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.