Diggin_Scraperでソフトバンクのサイトが処理できなくて困った（いちおう解決）

ソフトバンクの機種情報をDiggin_Scraperでスクレイピングするのに、普通の方法だとうまくいかなくていろいろ試したところ、こんな感じになりました。

<?php
require_once 'Diggin/Scraper.php';
require_once 'Diggin/Scraper/Adapter/Htmlscraping.php';
require_once 'Zend/Http/Response.php';

class Mobile_Profile_Adapter_Softbank_Attrstrip extends Diggin_Scraper_Adapter_Htmlscraping
{
    public function readData($response)
    {
        // htmlのdir属性を削る
        $body = $response->getBody();
        $body = str_replace('dir="ltr"', '', $body);

        // 内容は符号化されていないことにする
        $headers = $response->getHeaders();
        if (isset($headers['Transfer-encoding'])) {
            unset($headers['Transfer-encoding']);
        }

        $_response = new Zend_Http_Response(
            $response->getStatus(),
            $headers,
            $body
        );

        return parent::readData($_response);
    }
}

try {
    $url = 'http://creation.mb.softbank.jp/terminal/?lup=y&cat=http';

    $scraper = new Diggin_Scraper();
    $scraper->changeStrategy('Diggin_Scraper_Strategy_Flexible', new Mobile_Profile_Adapter_Softbank_Attrstrip());
    $scraper->process('//tr[@bgcolor="#FFFFFF"]/td[1]', 'model[] => TEXT')
            ->scrape($url);
    print_r($scraper->results);
} catch (Exception $e) {
    echo $e.PHP_EOL;
}
/*
Array
(
    [model] => Array
        (
            [0] => 831T
            [1] => 830T
            [2] => 931SH
            [3] => 930SC
            [4] => 930SH
            [5] => 830CA
            [6] => 830P
            [7] => 830SH s
            ......
            [183] => 304T
        )

)
*/

アダプタでレスポンスをいじってます。以下試行錯誤の記録など。

とりあえず普通にやってみた。

<?php
require_once 'Diggin/Scraper.php';

try {
    $url = 'http://creation.mb.softbank.jp/terminal/?lup=y&cat=http';

    $scraper = new Diggin_Scraper();
    $scraper->process('//tr[@bgcolor="#FFFFFF"]/td[1]', 'model[] => TEXT')
            ->scrape($url);
    print_r($scraper->results);
} catch (Exception $e) {
    echo $e.PHP_EOL;
}

実行。

$ php sample.php
exception 'Diggin_Scraper_Strategy_Exception' with message 'Couldn't find By Xpath, Process : './/tr[@bgcolor="#FFFFFF"]/td[1]', 'model => "TEXT"'' in /usr/share/php/Diggin/Scraper/Strategy/Flexible.php:96
Stack trace:
#0 /usr/share/php/Diggin/Scraper/Strategy/Flexible.php(76): Diggin_Scraper_Strategy_Flexible->extract(Object(SimpleXMLElement), Object(Diggin_Scraper_Process))
#1 /usr/share/php/Diggin/Scraper/Strategy/Abstract.php(44): Diggin_Scraper_Strategy_Flexible->scrape(Object(Zend_Http_Response), Object(Diggin_Scraper_Process))
#2 /usr/share/php/Diggin/Scraper/Context.php(32): Diggin_Scraper_Strategy_Abstract->scrapedData(Object(Diggin_Scraper_Process))
#3 /usr/share/php/Diggin/Scraper/Strategy/Abstract.php(74): Diggin_Scraper_Context->scrape(Object(Diggin_Scraper_Process))
#4 /usr/share/php/Diggin/Scraper.php(344): Diggin_Scraper_Strategy_Abstract->getValues(Object(Diggin_Scraper_Context), Object(Diggin_Scraper_Process))
#5 /home/okonomi/sample.php(36): Diggin_Scraper->scrape('http://creation...')
#6 {main}

なんか失敗。XPath式を「//body」とか絶対あるようなのにしても同じ。よくわからないので処理を追いかけてみる。

エラーを見るとDiggin/Scraper/Strategy/Flexible.phpの96行めで何かが起きているらしい（Diggin_Scraper_Strategy_Flexible::extract()）。

<?php
    public function extract($values, $process)
    {
        //↓このハンドリングはxpathの記述自体が間違ってたとき（いらないかな？）
        //set_error_handler(
        //    create_function('$errno, $errstr',
        //    'if($errno) require_once "Diggin/Scraper/Strategy/Exception.php"; 
        //       throw new Diggin_Scraper_Strategy_Exception($errstr, $errno);'
        //    )
        //);

        $results = (array) $values->xpath(self::_xpathOrCss2Xpath($process->expression));
        //restore_error_handler();

        if (count($results) === 0 or ($results[0] === false)) {
            require_once 'Diggin/Scraper/Strategy/Exception.php';
            
            $process->expression = self::_xpathOrCss2Xpath($process->expression);
            throw new Diggin_Scraper_Strategy_Exception("Couldn't find By Xpath, Process : $process");
        }
        
        return $results;
    }

引数はSimpleXMLElementとDiggin_Scraper_Processで、$value->xpath()の戻り値がよろしくないと例外を送出する。

このSimpleXMLElementが取得されてるところはDiggin_Scraper_Strategy_Flexible::scrape()

<?php
    public function scrape($respose, $process)
    {
        $simplexml = $this->getAdapter()->readData($respose);
        
        return self::extract($simplexml, $process);
    }

$this->getAdapter()->readData()の戻り値がSimpleXMLElement。
$this->getAdapter()はDiggin_Scraper_Adapter_Interfaceを継承したクラスのインスタンスが取得される（デフォルトでDiggin_Scraper_Adapter_Htmlscraping）。

つぎはDiggin_Scraper_Adapter_Htmlscraping::readData()

<?php
    public function readData($response)
    {
        return $this->getXmlObject($response);
    }

さらにDiggin_Scraper_Adapter_Htmlscraping::getXmlObject()（長いので端折りまくってます）

<?php
    final public function getXmlObject($response)
    {
...
        $xhtml = $this->getXhtml($response);
...
        $responseBody = preg_replace('/\sxmlns="[^"]+"/', '', $xhtml);
...
        try {
            //@see http://php.net/libxml.constants
            if (isset($this->config['libxmloptions'])) {
                $xml_object = @new SimpleXMLElement($responseBody, $this->config['libxmloptions']);
            } else {
                $xml_object = @new SimpleXMLElement($responseBody);
            }
        } catch (Exception $e) {
            require_once 'Diggin/Scraper/Adapter/Exception.php';
            throw new Diggin_Scraper_Adapter_Exception($e);
        }
...
        return $xml_object;
    }

ここで$responseから$xhtmlが取り出されて、それを引数にSimpleXMLElementが作成されてる。$responseが何者かは分からんけど。Diggin_Scraper_Adapter_Htmlscraping::getXhtml()はHTMLの整形をしているみたい。

何も問題なさそう…じゃあHTMLの方に原因が？

Diggin_Scraper使わずにやってみる。

<?php
$url = 'http://creation.mb.softbank.jp/terminal/?lup=y&cat=http';
$content = file_get_contents($url);
$xml = simplexml_import_dom(@DOMDocument::loadHTML($content));
print_r($xml->xpath('//tr[@bgcolor="#FFFFFF"]/td[1]'));
/*
Array
(
    [0] => SimpleXMLElement Object
        (
            [0] => 831T
        )
...
*/

普通に通る。んー…。

ページのソースを見てみる。htmlタグにxmlns属性の記述が。ん、SimpleXMLとxmlnsて何か問題なかったっけ。ググる。これか！
そういえばDiggin_Scraper_Adapter_Htmlscraping::getXmlObject()でxmlnsを削除してた。じゃあやっぱり問題ないんじゃ？

分からなくなってきたので、SimpleXMLElementの引数の$responseBodyを覗いてみた。

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html lang="ja" xml:lang="ja" dir="ltr" xmlns=
"http://www.w3.org/1999/xhtml">
<head>
...

xmlnsの後に改行が入ってる。このせいで正規表現にマッチしないのか？
Diggin_Scraper_Adapter_Htmlscraping::getXmlObject()のxmlnsを削除する正規表現を改行も考慮したものに書き換えてみた。

<?php
$responseBody = preg_replace('/\sxmlns=\n?"[^"]+"/', '', $xhtml);

結果

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html lang="ja" xml:lang="ja" dir="ltr">
<head>
...

xmlns属性が削除できてる。結果もちゃんと取得できた。

次はなんで改行が入ってるのかを探る。
$responseBodyを遡ってみると、Diggin_Scraper_Adapter_Htmlscraping::getXhtml()の中のtidyでHTMLを整形してるところの前後で改行が入ってた。

<?php
$responseBody = str_replace('&', '&amp;', $responseBody);
$tidy = new tidy;
$tidy->parseString($responseBody, array('output-xhtml' => true), 'UTF8');
$tidy->cleanRepair();
$responseBody = $tidy->html();

またググってみると、tidyはデフォルトで68文字で改行するらしい。改行しないようにオプションを追加してみる。

<?php
$tidy->parseString($responseBody, array('output-xhtml' => true, 'wrap' => 0), 'UTF8');

いけた。
でもDiggin本体に手を加えるのはいやなので、ほかの方法を考える。レスポンスに手を加えて改行されない程度に短くすればいいのでは？
ということでいちばん上のソースになりました（いやー長かった）。

再現コードも載せておきます。

<?php
require_once 'Diggin/Scraper.php';
require_once 'Zend/Http/Client.php';
require_once 'Zend/Http/Client/Adapter/Test.php';

$adapter = new Zend_Http_Client_Adapter_Test();
$client = new Zend_Http_Client();
$client->setAdapter($adapter);
Diggin_Scraper::setHttpClient($client);

try {
    // 例外が発生する
    $adapter->setResponse(
        'HTTP/1.1 200 OK'        ."\r\n".
        'Content-type: text/html'."\r\n".
                                  "\r\n".
        '<html lang="ja" xml:lang="ja" dir="ltr" xmlns="http://www.w3.org/1999/xhtml">aaa</html>'
    );
    $scraper = new Diggin_Scraper();
    $scraper->process('/body', 'hoge => TEXT')
            ->scrape('http://localhost/');
    print_r($scraper->results);
} catch (Exception $e) {
    echo $e.PHP_EOL;
}

try {
    // 成功する
    $adapter->setResponse(
        'HTTP/1.1 200 OK'        ."\r\n".
        'Content-type: text/html'."\r\n".
                                  "\r\n".
        '<html><body>aaa</body></html>'
    );
    $scraper = new Diggin_Scraper();
    $scraper->process('/body', 'hoge => TEXT')
            ->scrape('http://localhost/');
    print_r($scraper->results);
} catch (Exception $e) {
    echo $e.PHP_EOL;
}