最終更新:2012-04-05 (木) 05:38:02 (4397d)
DOMDocument
Top / DOMDocument
PHP
$xml = new DOMDocument('1.0', 'UTF-8');
DOMXPathと組み合わせてスクレイピングするサンプル
loadHTMLでの文字化け
Pay attention when loading html that has a different charset than iso-8859-1. Since this method does not actively try to figure out what the html you are trying to load is encoded in (like most browsers do), you have to specify it in the html head. If, for instance, your html is in utf-8, make sure you have a meta tag in the html's head section: <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> </head> If you do not specify the charset like this, all high-ascii bytes will be html-encoded. It is not enough to set the dom document you are loading the html in to UTF-8.
<? $html=file_get_contents("http://example.com"); $html = mb_convert_encoding($html, 'HTML-ENTITIES', 'auto');//文字化けするからHTMLエンティティに変換する $dom = new DOMDocument('1.0', 'UTF-8'); libxml_use_internal_errors(TRUE); // パースエラーのwarningを非表示に $dom->loadHTML($html); libxml_use_internal_errors(FALSE); // 元に戻す $xpath = new DOMXPath($dom); $results = $xpath->evaluate("/html/body//h3/a");//XPathで指定 $i=1; foreach($results as $item){ $href=$item->getAttribute('href'); $text=$item->textContent; //3つのうちどれでも取れるぽい //var_dump($item->nodeValue); //var_dump($item->textContent); //var_dump($item->firstChild->data); echo "$i <a href='{$href}'>{$text}</a><br>"; $i++; } ?>