Diggin_Scraperを使ってみる

PHP関西 勉強会で発表を聞いてからずっと気になってたDiggin_Scraperを使ってみました。以下作業ログ垂れ流し。

インストール。

$ sudo pear install http://diggin.musicrider.com/Diggin.tgz
downloading Diggin.tgz ...
Starting to download Diggin.tgz (220,588 bytes)
................done: 220,588 bytes
downloading Diggin_Scraper_Adapter_Htmlscraping.tgz ...
Starting to download Diggin_Scraper_Adapter_Htmlscraping.tgz (4,546 bytes)
...done: 4,546 bytes
__uri/Diggin_Scraper_Adapter_Htmlscraping requires PHP extension "tidy"
__uri/Diggin requires package "http://diggin.musicrider.com/Diggin_Scraper_Adapter_Htmlscraping"
No valid packages found
install failed

Diggin_Scraper_Adapter_Htmlscrapingが無いみたい?

$ sudo pear install openpear/Diggin_Scraper_Adapter_Htmlscraping-beta
downloading Diggin_Scraper_Adapter_Htmlscraping-0.3.3.tgz ...
Starting to download Diggin_Scraper_Adapter_Htmlscraping-0.3.3.tgz (6,750 bytes)
.....done: 6,750 bytes
install ok: channel://openpear.org/Diggin_Scraper_Adapter_Htmlscraping-0.3.3
$ sudo pear install http://diggin.musicrider.com/Diggin.tgz
downloading Diggin.tgz ...
Starting to download Diggin.tgz (220,588 bytes)
................done: 220,588 bytes
downloading Diggin_Scraper_Adapter_Htmlscraping.tgz ...
Starting to download Diggin_Scraper_Adapter_Htmlscraping.tgz (4,546 bytes)
...done: 4,546 bytes
__uri/Diggin_Scraper_Adapter_Htmlscraping requires PHP extension "tidy"
__uri/Diggin requires package "http://diggin.musicrider.com/Diggin_Scraper_Adapter_Htmlscraping"
No valid packages found
install failed

変わらず。tidyをインストールすればいいのか?

$ sudo pecl install -a tidy
downloading tidy-1.2.tgz ...
Starting to download tidy-1.2.tgz (9,602 bytes)
.....done: 9,602 bytes
3 source files, building
running: phpize
Configuring for:
PHP Api Version:         20041225
Zend Module Api No:      20060613
Zend Extension Api No:   220060519
 1. Tidy library installation dir? : autodetect

1-1, 'all', 'abort', or Enter to continue:
building in /var/tmp/pear-build-root/tidy-1.2
running: /tmp/pear/download/tidy-1.2/configure --with-tidy
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for a sed that does not truncate output... /bin/sed
checking for gcc... gcc
checking for C compiler default output file name... a.out
checking whether the C compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables...
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking whether gcc and cc understand -c and -o together... yes
checking for system library directory... lib
checking if compiler supports -R... no
checking if compiler supports -Wl,-rpath,... yes
checking build system type... i686-pc-linux-gnu
checking host system type... i686-pc-linux-gnu
checking target system type... i686-pc-linux-gnu
checking for PHP prefix... /usr
checking for PHP includes... -I/usr/include/php5 -I/usr/include/php5/main -I/usr/include/php5/TSRM -I/usr/include/php5/Zend -I/usr/include/php5/ext -I/usr/include/php5/ext/date/lib -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64
checking for PHP extension directory... /usr/lib/php5/20060613+lfs
checking for PHP installed headers prefix... /usr/include/php5
checking for re2c... re2c
checking for re2c version... 0.13.3 (ok)
checking for gawk... no
checking for nawk... nawk
checking if nawk is broken... no
checking for TIDY support... yes, shared
configure: error: Cannot find libtidy
ERROR: `/tmp/pear/download/tidy-1.2/configure --with-tidy' failed

peclからのインストールはなんか失敗した。パッケージから探す。

$ aptitude search tidy | grep php
p   php5-tidy                       - tidy module for php5

$ sudo aptitude install php5-tidy

もういちどインストール。

$ sudo pear install http://diggin.musicrider.com/Diggin.tgz
downloading Diggin.tgz ...
Starting to download Diggin.tgz (220,588 bytes)
..............................................done: 220,588 bytes
downloading Diggin_Scraper_Adapter_Htmlscraping.tgz ...
Starting to download Diggin_Scraper_Adapter_Htmlscraping.tgz (4,546 bytes)
...done: 4,546 bytes
install ok: channel://__uri/Diggin_Scraper_Adapter_Htmlscraping-0.2.1
ERROR: __uri/Diggin: conflicting files found:
           Diggin/Http/Exception.php (openpear.org/diggin_scraper_adapter_htmlscraping)
Diggin/Scraper/Adapter/Exception.php (openpear.org/diggin_scraper_adapter_htmlscraping)
        Diggin/Scraper/Exception.php (openpear.org/diggin_scraper_adapter_htmlscraping)
                Diggin/Exception.php (openpear.org/diggin_scraper_adapter_htmlscraping)

ぎゃー初めて見るエラーが!Diggin_Scraper_Adapter_Htmlscrapingを個別にインストールしたのがまずかった?

$ sudo pear uninstall openpear/diggin_scraper_adapter_htmlscraping
uninstall ok: channel://openpear.org/Diggin_Scraper_Adapter_Htmlscraping-0.3.3
$ sudo pear install http://diggin.musicrider.com/Diggin.tgz
downloading Diggin.tgz ...
Starting to download Diggin.tgz (220,588 bytes)
.........................................done: 220,588 bytes
install ok: channel://__uri/Diggin-0.5.1


サンプル。

<?php
require_once 'Diggin/Scraper.php';

try {
    $scraper = new Diggin_Scraper();
    $scraper->process('a.title', "title => 'TEXT'", "url => '@href'")
            ->scrape('http://d.hatena.ne.jp/keyword/%BA%B0%CC%EE%A4%A2%A4%B5%C8%FE');
    print_r($scraper->results);
} catch (Exception $e) {
    die($e);
}

実行。

$ php sample.php
Warning: Diggin_Scraper::require_once(Zend/Http/Client.php): failed to open stream: No such file or directory in /usr/share/php/Diggin/Scraper.php on line 12

Zend Frameworkをインストールしてなかった。
Ubuntuのパッケージにあったのでそこからインストールする。バージョンが古いっぽいのが気になるけど。

$ aptitude search zend
p   libzend-framework-php                                                     - a simple, straightforward, open-source software framework for PHP 5
p   zend-framework                                                            - a simple, straightforward, open-source software framework for PHP 5

$ sudo aptitude install zend-framework

/usr/share/php/libzend-framework-phpにインストールされたので、/etc/php5/cli/php.iniにinclude_pathを追加。

実行。

$ php sample.php
Warning: require_once(Zend/Dom/Query/Css2Xpath.php): failed to open stream: No such file or directory in /usr/share/php/Diggin/Scraper/Strategy/Flexible.php on line 18

やはりバージョンが古かったか。

$ sudo aptitude remove zend-framework

php.iniも元に戻しておく。

ダウンロードページから最新版(1.7.1)をダウンロード、インストールする。

$ wget http://framework.zend.com/releases/ZendFramework-1.7.1/ZendFramework-1.7.1-minimal.tar.gz
$ tar xvfz ZendFramework-1.7.1-minimal.tar.gz
$ sudo mv ZendFramework-1.7.1-minimal/library/Zend /usr/share/php/

実行。

$ php sample.php
Warning: Diggin_Scraper_Strategy_Flexible::require_once(Diggin/Scraper/Adapter/Htmlscraping.php): failed to open stream: No such file or directory in /usr/share/php/Diggin/Scraper/Strategy/Flexible.php on line 57

Diggin_Scraper_Adapter_Htmlscrapingやっぱいんのか!

$ sudo pear install openpear/Diggin_Scraper_Adapter_Htmlscraping-beta

実行。

$ php sample.php
Warning: Diggin_Uri_Http::require_once(Net/URL2.php): failed to open stream: No such file or directory in /usr/share/php/Diggin/Uri/Http.php on line 54

・・・。

$ sudo pear install -a Net_URL2-beta

実行。

$ php sample.php
Array
(
    [title] => 紺野あさ美
    [url] => http://d.hatena.ne.jp/keyword/%ba%b0%cc%ee%a4%a2%a4%b5%c8%fe
)

キタ!


Diggin_Scraper_Adapter_Htmlscrapingのインストールでトラブったけど、改めて作業ログを見直してみると、php5-tidyをインストールしてからDigginをインストールするのが正しい手順だった気がする。

いちど全部アンインストールしてから再度インストールしてみる。

$ sudo pear uninstall __uri/Diggin_Scraper_Adapter_Htmlscraping
Notice: Undefined index:  channel in PEAR/Dependency2.php on line 910
uninstall ok: channel://__uri/Diggin_Scraper_Adapter_Htmlscraping-0.2.1
$ sudo pear uninstall openpear/Diggin_Scraper_Adapter_Htmlscraping
uninstall ok: channel://openpear.org/Diggin_Scraper_Adapter_Htmlscraping-0.3.3
$ sudo pear uninstall __uri/Diggin
uninstall ok: channel://__uri/Diggin-0.5.1
$ sudo pear install http://diggin.musicrider.com/Diggin.tgz
downloading Diggin.tgz ...
Starting to download Diggin.tgz (220,588 bytes)
.........................................done: 220,588 bytes
downloading Diggin_Scraper_Adapter_Htmlscraping.tgz ...
Starting to download Diggin_Scraper_Adapter_Htmlscraping.tgz (4,546 bytes)
...done: 4,546 bytes
install ok: channel://__uri/Diggin_Scraper_Adapter_Htmlscraping-0.2.1
install ok: channel://__uri/Diggin-0.5.1


実行。

$ php sample.php
Array
(
    [title] => 紺野あさ美
    [url] => http://d.hatena.ne.jp/keyword/%ba%b0%cc%ee%a4%a2%a4%b5%c8%fe
)

よし、だいじょうぶ。