Diggin_Scraperを使ってみる
PHP関西 勉強会で発表を聞いてからずっと気になってたDiggin_Scraperを使ってみました。以下作業ログ垂れ流し。
インストール。
$ sudo pear install http://diggin.musicrider.com/Diggin.tgz downloading Diggin.tgz ... Starting to download Diggin.tgz (220,588 bytes) ................done: 220,588 bytes downloading Diggin_Scraper_Adapter_Htmlscraping.tgz ... Starting to download Diggin_Scraper_Adapter_Htmlscraping.tgz (4,546 bytes) ...done: 4,546 bytes __uri/Diggin_Scraper_Adapter_Htmlscraping requires PHP extension "tidy" __uri/Diggin requires package "http://diggin.musicrider.com/Diggin_Scraper_Adapter_Htmlscraping" No valid packages found install failed
Diggin_Scraper_Adapter_Htmlscrapingが無いみたい?
$ sudo pear install openpear/Diggin_Scraper_Adapter_Htmlscraping-beta downloading Diggin_Scraper_Adapter_Htmlscraping-0.3.3.tgz ... Starting to download Diggin_Scraper_Adapter_Htmlscraping-0.3.3.tgz (6,750 bytes) .....done: 6,750 bytes install ok: channel://openpear.org/Diggin_Scraper_Adapter_Htmlscraping-0.3.3
$ sudo pear install http://diggin.musicrider.com/Diggin.tgz downloading Diggin.tgz ... Starting to download Diggin.tgz (220,588 bytes) ................done: 220,588 bytes downloading Diggin_Scraper_Adapter_Htmlscraping.tgz ... Starting to download Diggin_Scraper_Adapter_Htmlscraping.tgz (4,546 bytes) ...done: 4,546 bytes __uri/Diggin_Scraper_Adapter_Htmlscraping requires PHP extension "tidy" __uri/Diggin requires package "http://diggin.musicrider.com/Diggin_Scraper_Adapter_Htmlscraping" No valid packages found install failed
変わらず。tidyをインストールすればいいのか?
$ sudo pecl install -a tidy downloading tidy-1.2.tgz ... Starting to download tidy-1.2.tgz (9,602 bytes) .....done: 9,602 bytes 3 source files, building running: phpize Configuring for: PHP Api Version: 20041225 Zend Module Api No: 20060613 Zend Extension Api No: 220060519 1. Tidy library installation dir? : autodetect 1-1, 'all', 'abort', or Enter to continue: building in /var/tmp/pear-build-root/tidy-1.2 running: /tmp/pear/download/tidy-1.2/configure --with-tidy checking for grep that handles long lines and -e... /bin/grep checking for egrep... /bin/grep -E checking for a sed that does not truncate output... /bin/sed checking for gcc... gcc checking for C compiler default output file name... a.out checking whether the C compiler works... yes checking whether we are cross compiling... no checking for suffix of executables... checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ISO C89... none needed checking whether gcc and cc understand -c and -o together... yes checking for system library directory... lib checking if compiler supports -R... no checking if compiler supports -Wl,-rpath,... yes checking build system type... i686-pc-linux-gnu checking host system type... i686-pc-linux-gnu checking target system type... i686-pc-linux-gnu checking for PHP prefix... /usr checking for PHP includes... -I/usr/include/php5 -I/usr/include/php5/main -I/usr/include/php5/TSRM -I/usr/include/php5/Zend -I/usr/include/php5/ext -I/usr/include/php5/ext/date/lib -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 checking for PHP extension directory... /usr/lib/php5/20060613+lfs checking for PHP installed headers prefix... /usr/include/php5 checking for re2c... re2c checking for re2c version... 0.13.3 (ok) checking for gawk... no checking for nawk... nawk checking if nawk is broken... no checking for TIDY support... yes, shared configure: error: Cannot find libtidy ERROR: `/tmp/pear/download/tidy-1.2/configure --with-tidy' failed
peclからのインストールはなんか失敗した。パッケージから探す。
$ aptitude search tidy | grep php p php5-tidy - tidy module for php5 $ sudo aptitude install php5-tidy
もういちどインストール。
$ sudo pear install http://diggin.musicrider.com/Diggin.tgz downloading Diggin.tgz ... Starting to download Diggin.tgz (220,588 bytes) ..............................................done: 220,588 bytes downloading Diggin_Scraper_Adapter_Htmlscraping.tgz ... Starting to download Diggin_Scraper_Adapter_Htmlscraping.tgz (4,546 bytes) ...done: 4,546 bytes install ok: channel://__uri/Diggin_Scraper_Adapter_Htmlscraping-0.2.1 ERROR: __uri/Diggin: conflicting files found: Diggin/Http/Exception.php (openpear.org/diggin_scraper_adapter_htmlscraping) Diggin/Scraper/Adapter/Exception.php (openpear.org/diggin_scraper_adapter_htmlscraping) Diggin/Scraper/Exception.php (openpear.org/diggin_scraper_adapter_htmlscraping) Diggin/Exception.php (openpear.org/diggin_scraper_adapter_htmlscraping)
ぎゃー初めて見るエラーが!Diggin_Scraper_Adapter_Htmlscrapingを個別にインストールしたのがまずかった?
$ sudo pear uninstall openpear/diggin_scraper_adapter_htmlscraping uninstall ok: channel://openpear.org/Diggin_Scraper_Adapter_Htmlscraping-0.3.3
$ sudo pear install http://diggin.musicrider.com/Diggin.tgz downloading Diggin.tgz ... Starting to download Diggin.tgz (220,588 bytes) .........................................done: 220,588 bytes install ok: channel://__uri/Diggin-0.5.1
サンプル。
<?php require_once 'Diggin/Scraper.php'; try { $scraper = new Diggin_Scraper(); $scraper->process('a.title', "title => 'TEXT'", "url => '@href'") ->scrape('http://d.hatena.ne.jp/keyword/%BA%B0%CC%EE%A4%A2%A4%B5%C8%FE'); print_r($scraper->results); } catch (Exception $e) { die($e); }
実行。
$ php sample.php Warning: Diggin_Scraper::require_once(Zend/Http/Client.php): failed to open stream: No such file or directory in /usr/share/php/Diggin/Scraper.php on line 12
Zend Frameworkをインストールしてなかった。
Ubuntuのパッケージにあったのでそこからインストールする。バージョンが古いっぽいのが気になるけど。
$ aptitude search zend p libzend-framework-php - a simple, straightforward, open-source software framework for PHP 5 p zend-framework - a simple, straightforward, open-source software framework for PHP 5 $ sudo aptitude install zend-framework
/usr/share/php/libzend-framework-phpにインストールされたので、/etc/php5/cli/php.iniにinclude_pathを追加。
実行。
$ php sample.php Warning: require_once(Zend/Dom/Query/Css2Xpath.php): failed to open stream: No such file or directory in /usr/share/php/Diggin/Scraper/Strategy/Flexible.php on line 18
やはりバージョンが古かったか。
$ sudo aptitude remove zend-framework
php.iniも元に戻しておく。
ダウンロードページから最新版(1.7.1)をダウンロード、インストールする。
$ wget http://framework.zend.com/releases/ZendFramework-1.7.1/ZendFramework-1.7.1-minimal.tar.gz $ tar xvfz ZendFramework-1.7.1-minimal.tar.gz $ sudo mv ZendFramework-1.7.1-minimal/library/Zend /usr/share/php/
実行。
$ php sample.php Warning: Diggin_Scraper_Strategy_Flexible::require_once(Diggin/Scraper/Adapter/Htmlscraping.php): failed to open stream: No such file or directory in /usr/share/php/Diggin/Scraper/Strategy/Flexible.php on line 57
Diggin_Scraper_Adapter_Htmlscrapingやっぱいんのか!
$ sudo pear install openpear/Diggin_Scraper_Adapter_Htmlscraping-beta
実行。
$ php sample.php Warning: Diggin_Uri_Http::require_once(Net/URL2.php): failed to open stream: No such file or directory in /usr/share/php/Diggin/Uri/Http.php on line 54
・・・。
$ sudo pear install -a Net_URL2-beta
実行。
$ php sample.php Array ( [title] => 紺野あさ美 [url] => http://d.hatena.ne.jp/keyword/%ba%b0%cc%ee%a4%a2%a4%b5%c8%fe )
キタ!
Diggin_Scraper_Adapter_Htmlscrapingのインストールでトラブったけど、改めて作業ログを見直してみると、php5-tidyをインストールしてからDigginをインストールするのが正しい手順だった気がする。
いちど全部アンインストールしてから再度インストールしてみる。
$ sudo pear uninstall __uri/Diggin_Scraper_Adapter_Htmlscraping Notice: Undefined index: channel in PEAR/Dependency2.php on line 910 uninstall ok: channel://__uri/Diggin_Scraper_Adapter_Htmlscraping-0.2.1 $ sudo pear uninstall openpear/Diggin_Scraper_Adapter_Htmlscraping uninstall ok: channel://openpear.org/Diggin_Scraper_Adapter_Htmlscraping-0.3.3 $ sudo pear uninstall __uri/Diggin uninstall ok: channel://__uri/Diggin-0.5.1 $ sudo pear install http://diggin.musicrider.com/Diggin.tgz downloading Diggin.tgz ... Starting to download Diggin.tgz (220,588 bytes) .........................................done: 220,588 bytes downloading Diggin_Scraper_Adapter_Htmlscraping.tgz ... Starting to download Diggin_Scraper_Adapter_Htmlscraping.tgz (4,546 bytes) ...done: 4,546 bytes install ok: channel://__uri/Diggin_Scraper_Adapter_Htmlscraping-0.2.1 install ok: channel://__uri/Diggin-0.5.1
実行。
$ php sample.php Array ( [title] => 紺野あさ美 [url] => http://d.hatena.ne.jp/keyword/%ba%b0%cc%ee%a4%a2%a4%b5%c8%fe )
よし、だいじょうぶ。