cheerio-soupselect
Adds CSS selector support to htmlparser for scraping activities - port of soupselect (python)
Last updated 13 years ago by mattmueller .
MIT · Repository · Original npm · Tarball · package.json
$ npm install cheerio-soupselect 
SYNC missed versions from official npm registry.

node-soupselect

A port of Simon Willison's soupselect for use with node.js and node-htmlparser.

$ npm install soupselect

Minimal example...

var select = require('soupselect').select;
// dom provided by htmlparser...
select(dom, "#main a.article").forEach(function(element) {//...});

Wanted a friendly way to scrape HTML using node.js. Tried using jsdom, prompted by this article but, unfortunately, jsdom takes a strict view of lax HTML making it unusable for scraping the kind of soup found in real world web pages. Luckily htmlparser is more forgiving. More details on this found here.

A complete example including fetching HTML etc...;

var select = require('soupselect').select,
    htmlparser = require("htmlparser"),
    http = require('http'),
    sys = require('sys');

// fetch some HTML...
var http = require('http');
var host = 'www.reddit.com';
var client = http.createClient(80, host);
var request = client.request('GET', '/',{'host': host});

request.on('response', function (response) {
    response.setEncoding('utf8');

    var body = "";
    response.on('data', function (chunk) {
        body = body + chunk;
    });

    response.on('end', function() {
    
        // now we have the whole body, parse it and select the nodes we want...
        var handler = new htmlparser.DefaultHandler(function(err, dom) {
            if (err) {
                sys.debug("Error: " + err);
            } else {
            
                // soupselect happening here...
                var titles = select(dom, 'a.title');
            
                sys.puts("Top stories from reddit");
                titles.forEach(function(title) {
                    sys.puts("- " + title.children[0].raw + " [" + title.attribs.href + "]\n");
                })
            }
        });

        var parser = new htmlparser.Parser(handler);
        parser.parseComplete(body);
    });
});
request.end();

Notes:

  • Requires node-htmlparser > 1.6.2 & node.js 2+
  • Calls to select are synchronous - not worth trying to make it asynchronous IMO given the use case

Current Tags

  • 0.1.1                                ...           latest (13 years ago)

5 Versions

  • 0.1.1                                ...           13 years ago
  • 0.1.0                                ...           13 years ago
  • 0.0.3                                ...           13 years ago
  • 0.0.2                                ...           13 years ago
  • 0.0.1                                ...           13 years ago
Maintainers (1)
Downloads
Total 4
Today 0
This Week 0
This Month 0
Last Day 0
Last Week 0
Last Month 0
Dependencies (1)
Dev Dependencies (1)
Dependents (1)

© 2010 - cnpmjs.org x YWFE | Home | YWFE