1. http://www.covert.io/post/18414889381/accumulo-nutch-and-gora
2. http://blog.packetloop.com/2012/03/packetpig-open-source-big-data-security.html
3. http://blog.csdn.net/lengyue365/article/details/7874003
4. http://nlp.solutions.asia/?p=232
5. http://wiki.apache.org/nutch/NewScoringIndexingExample
环境说明:
accumulo 1.5
nutch 2.0 nutchgora git-branch
hadoop 0.20.2
zookeeper 3.4.3
gora
solr 3.6.1
webpage数据说明:
<gora-orm>
<table name="webpage">
<family name="p" /> <!-- This can also have params like compression, bloom filters -->
<family name="f" />
<family name="s" />
<family name="il" />
<family name="ol" />
<family name="h" />
<family name="mtdt" />
<family name="mk" />
<config key="table.file.compress.blocksize" value="32K"/>
</table>
<class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">
<!-- fetch fields -->
<field name="baseUrl" family="f" qualifier="bas"/>
<field name="status" family="f" qualifier="st"/>
<field name="prevFetchTime" family="f" qualifier="pts"/>
<field name="fetchTime" family="f" qualifier="ts"/>
<field name="fetchInterval" family="f" qualifier="fi"/>
<field name="retriesSinceFetch" family="f" qualifier="rsf"/>
<field name="reprUrl" family="f" qualifier="rpr"/>
<field name="content" family="f" qualifier="cnt"/>
<field name="contentType" family="f" qualifier="typ"/>
<field name="protocolStatus" family="f" qualifier="prot"/>
<field name="modifiedTime" family="f" qualifier="mod"/>
<!-- parse fields -->
<field name="title" family="p" qualifier="t"/>
<field name="text" family="p" qualifier="c"/>
<field name="parseStatus" family="p" qualifier="st"/>
<field name="signature" family="p" qualifier="sig"/>
<field name="prevSignature" family="p" qualifier="psig"/>
<!-- score fields -->
<field name="score" family="s" qualifier="s"/>
<field name="headers" family="h"/>
<field name="inlinks" family="il"/>
<field name="outlinks" family="ol"/>
<field name="metadata" family="mtdt"/>
<field name="markers" family="mk"/>
</class>
<table name="host">
<family name="mtdt" />
<family name="il" />
<family name="ol" />
</table>
<class table="host" keyClass="java.lang.String" name="org.apache.nutch.storage.Host">
<field name="metadata" family="mtdt"/>
<field name="inlinks" family="il"/>
<field name="outlinks" family="ol"/>
</class>
</gora-orm>
登录accumulo,查看webpage表结构:
./accumulo shell -u xxx -p xxx
root@inst> table webpage
1.在用户目录下创建名为urls的文件,加入一行:http://www.360buy.com/
执行./nutch inject ~/urls
root@inst webpage> scan -r com.360buy.www:http/
com.360buy.www:http/ f:fi [] \x00'\x8D\x00
com.360buy.www:http/ f:ts [] \x00\x00\x01:':\xA6\xE2
com.360buy.www:http/ mk:_injmrk_ [] y
com.360buy.www:http/ mtdt:_csh_ [] ?\x80\x00\x00
com.360buy.www:http/ s:s [] ?\x80\x00\x000
2. ./nutch generate
root@inst webpage> scan -r com.360buy.www:http/
com.360buy.www:http/ f:fi [] \x00'\x8D\x00
com.360buy.www:http/ f:ts [] \x00\x00\x01:':\xA6\xE2
com.360buy.www:http/ mk:_gnmrk_ [] 1349277947-925721513
com.360buy.www:http/ mk:_injmrk_ [] y
com.360buy.www:http/ mtdt:_csh_ [] ?\x80\x00\x00
com.360buy.www:http/ s:s [] ?\x80\x00\x00
3. ./nutch fetch 1349277947-925721513
root@inst webpage> scan -r com.360buy.www:http/ -f 50
com.360buy.www:http/ f:bas [] http://www.360buy.com/
com.360buy.www:http/ f:cnt [] <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Trans
com.360buy.www:http/ f:fi [] \x00'\x8D\x00
com.360buy.www:http/ f:prot [] \x02\x00\x00
com.360buy.www:http/ f:pts [] \x00\x00\x01:':\xA6\xE2
com.360buy.www:http/ f:st [] \x00\x00\x00\x02
com.360buy.www:http/ f:ts [] \x00\x00\x01:'<\x8C}
com.360buy.www:http/ f:typ [] application/xhtml+xml
com.360buy.www:http/ h:Cache-Control [] max-age=120
com.360buy.www:http/ h:Connection [] close
com.360buy.www:http/ h:Content-Encoding [] gzip
com.360buy.www:http/ h:Content-Location [] http://www.360buy.com/index.htm
com.360buy.www:http/ h:Content-Type [] text/html; charset=gb2312
com.360buy.www:http/ h:Date [] Wed, 03 Oct 2012 15:27:02 GMT
com.360buy.www:http/ h:Last-Modified [] Wed, 03 Oct 2012 15:25:57 GMT
com.360buy.www:http/ h:Server [] JDWS
com.360buy.www:http/ h:Vary [] Accept-Encoding
com.360buy.www:http/ h:X-Cache [] MISS from TJ-HY-CNC-CDN-55.360buy.com
com.360buy.www:http/ h:_ip [] 125.39.96.182
com.360buy.www:http/ mk:_ftcmrk_ [] 1349277947-925721513
com.360buy.www:http/ mk:_gnmrk_ [] 1349277947-925721513
com.360buy.www:http/ mk:_injmrk_ [] y
com.360buy.www:http/ mtdt:_csh_ [] ?\x80\x00\x00
com.360buy.www:http/ s:s [] ?\x80\x00\x00
4. ./nutch parse 1349277947-925721513
root@inst webpage> scan -r com.360buy.www:http/ -f 50
com.360buy.www:http/ f:bas [] http://www.360buy.com/
com.360buy.www:http/ f:cnt [] <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Trans
com.360buy.www:http/ f:fi [] \x00'\x8D\x00
com.360buy.www:http/ f:prot [] \x02\x00\x00
com.360buy.www:http/ f:pts [] \x00\x00\x01:':\xA6\xE2
com.360buy.www:http/ f:st [] \x00\x00\x00\x02
com.360buy.www:http/ f:ts [] \x00\x00\x01:'<\x8C}
com.360buy.www:http/ f:typ [] application/xhtml+xml
com.360buy.www:http/ h:Cache-Control [] max-age=120
com.360buy.www:http/ h:Connection [] close
com.360buy.www:http/ h:Content-Encoding [] gzip
com.360buy.www:http/ h:Content-Location [] http://www.360buy.com/index.htm
com.360buy.www:http/ h:Content-Type [] text/html; charset=gb2312
com.360buy.www:http/ h:Date [] Wed, 03 Oct 2012 15:27:02 GMT
com.360buy.www:http/ h:Last-Modified [] Wed, 03 Oct 2012 15:25:57 GMT
com.360buy.www:http/ h:Server [] JDWS
com.360buy.www:http/ h:Vary [] Accept-Encoding
com.360buy.www:http/ h:X-Cache [] MISS from TJ-HY-CNC-CDN-55.360buy.com
com.360buy.www:http/ h:_ip [] 125.39.96.182
com.360buy.www:http/ mk:__prsmrk__ [] 1349277947-925721513
com.360buy.www:http/ mk:_ftcmrk_ [] 1349277947-925721513
com.360buy.www:http/ mk:_gnmrk_ [] 1349277947-925721513
com.360buy.www:http/ mk:_injmrk_ [] y
com.360buy.www:http/ mtdt:_csh_ [] ?\x80\x00\x00
com.360buy.www:http/ ol:http://app.360buy.com/ [] \xE6\x89\x8B\xE6\x9C\xBA\xE4\xBA\xAC\xE4\xB8\x9C
---------------------------------------------- hit any key to continue or 'q' to quit ----------------------------------------------
com.360buy.www:http/ ol:http://book.360buy.com/ [] \xE5\x9B\xBE\xE4\xB9\xA6
com.360buy.www:http/ ol:http://caipiao.360buy.com/ [] \xE5\xBD\xA9\xE7\xA5\xA8
com.360buy.www:http/ ol:http://chat.360buy.com/jdchat/custom.action [] \xE5\x9C\xA8\xE7\xBA\xBF\xE5\xAE\xA2\xE6\x9C\x8D
com.360buy.www:http/ ol:http://chongzhi.360buy.com/ [] \xE5\x85\x85\xE5\x80\xBC
com.360buy.www:http/ ol:http://diy.360buy.com/ [] \xE8\xA3\x85\xE6\x9C\xBA\xE5\xA4\xA7\xE5\xB8\x88
com.360buy.www:http/ ol:http://e.360buy.com/index.html [] \xE7\x94\xB5\xE5\xAD\x90\xE4\xB9\xA6\xE5\x88\x8A
com.360buy.www:http/ ol:http://game.360buy.com/ [] \xE6\xB8\xB8\xE6\x88\x8F
com.360buy.www:http/ ol:http://help.360buy.com/ [] \xE5\xAE\xA2\xE6\x88\xB7\xE6\x9C\x8D\xE5\x8A\xA1
com.360buy.www:http/ ol:http://help.360buy.com/help/question-61.html [] \xE5\xB8\xB8\xE8\xA7\x81\xE9\x97\xAE\xE9\xA2\x98
com.360buy.www:http/ ol:http://home.360buy.com/ [] \xE6\x88\x91\xE7\x9A\x84\xE4\xBA\xAC\xE4\xB8\x9C
com.360buy.www:http/ ol:http://jd2008.360buy.com/JdHome/OrderList.aspx [] \xE6\x88\x91\xE7\x9A\x84\xE8\xAE\xA2\xE5\x8D\x95
com.360buy.www:http/ ol:http://jd2008.360buy.com/purchase/ShoppingCart.asp [] \xE5\x8E\xBB\xE8\xB4\xAD\xE7\x89\xA9\xE8\xBD\xA6\xE7\xBB\x93\xE7\xAE\x97
com.360buy.www:http/ ol:http://market.360buy.com/giftcard/ [] \xE7\xA4\xBC\xE5\x93\x81\xE5\x8D\xA1
com.360buy.www:http/ ol:http://market.360buy.com/giftcard/company/default. [] \xE4\xBC\x81\xE4\xB8\x9A\xE5\xAE\xA2\xE6\x88\xB7
com.360buy.www:http/ ol:http://mvd.360buy.com/ [] \xE9\x9F\xB3\xE5\x83\x8F
com.360buy.www:http/ ol:http://myjd.360buy.com/opinion/list.action [] \xE6\x8A\x95\xE8\xAF\x89\xE4\xB8\xAD\xE5\xBF\x83
com.360buy.www:http/ ol:http://myjd.360buy.com/repair/orderlist.action [] \xE5\x94\xAE\xE5\x90\x8E\xE6\x9C\x8D\xE5\x8A\xA1
com.360buy.www:http/ ol:http://read.360buy.com/ [] \xE5\x9C\xA8\xE7\xBA\xBF\xE8\xAF\xBB\xE4\xB9\xA6
com.360buy.www:http/ ol:http://sale.360buy.com/p10997.html [] \xE5\x8A\x9E\xE5\x85\xAC\xE7\x9B\xB4\xE9\x80\x9A\xE8\xBD\xA6
com.360buy.www:http/ ol:http://trip.360buy.com/ [] \xE6\x97\x85\xE8\xA1\x8C
com.360buy.www:http/ ol:http://www.360buy.com/ [] \xE9\xA6\x96\xE9\xA1\xB5
com.360buy.www:http/ ol:http://www.360buy.com/auto.html [] \xE6\xB1\xBD\xE8\xBD\xA6\xE7\x94\xA8\xE5\x93\x81
com.360buy.www:http/ ol:http://www.360buy.com/baby.html [] \xE6\xAF\x8D\xE5\xA9\xB4
com.360buy.www:http/ ol:http://www.360buy.com/bag.html [] \xE7\xA4\xBC\xE5\x93\x81\xE7\xAE\xB1\xE5\x8C\x85
---------------------------------------------- hit any key to continue or 'q' to quit ----------------------------------------------
com.360buy.www:http/ ol:http://www.360buy.com/beauty.html [] \xE4\xB8\xAA\xE6\x8A\xA4\xE5\x8C\x96\xE5\xA6\x86
com.360buy.www:http/ ol:http://www.360buy.com/clothing.html [] \xE6\x9C\x8D\xE9\xA5\xB0\xE9\x9E\x8B\xE5\xB8\xBD
com.360buy.www:http/ ol:http://www.360buy.com/computer.html [] \xE7\x94\xB5\xE8\x84\x91\xE3\x80\x81\xE5\x8A\x9E\xE5\x85\xAC
com.360buy.www:http/ ol:http://www.360buy.com/contact/service.html [] \xE5\xAE\xA2\xE6\x9C\x8D\xE9\x82\xAE\xE7\xAE\xB1
com.360buy.www:http/ ol:http://www.360buy.com/digital.html [] \xE6\x89\x8B\xE6\x9C\xBA\xE6\x95\xB0\xE7\xA0\x81
com.360buy.www:http/ ol:http://www.360buy.com/electronic.html [] \xE5\xAE\xB6\xE7\x94\xA8\xE7\x94\xB5\xE5\x99\xA8
com.360buy.www:http/ ol:http://www.360buy.com/food.html [] \xE9\xA3\x9F\xE5\x93\x81\xE9\xA5\xAE\xE6\x96\x99\xE3\x80\x81\xE4\xBF\x9D\xE5\x81\xA5\xE9\xA3\x9F\xE5\x93\x81
com.360buy.www:http/ ol:http://www.360buy.com/home.html [] \xE5\xAE\xB6\xE5\xB1\x85\xE5\xAE\xB6\xE8\xA3\x85
com.360buy.www:http/ ol:http://www.360buy.com/jewellery.html [] \xE7\x8F\xA0\xE5\xAE\x9D
com.360buy.www:http/ ol:http://www.360buy.com/kitchenware.html [] \xE5\x8E\xA8\xE5\x85\xB7
com.360buy.www:http/ ol:http://www.360buy.com/sports.html [] \xE8\xBF\x90\xE5\x8A\xA8\xE5\x81\xA5\xE5\xBA\xB7
com.360buy.www:http/ ol:http://www.360buy.com/toys.html [] \xE7\x8E\xA9\xE5\x85\xB7\xE4\xB9\x90\xE5\x99\xA8
com.360buy.www:http/ ol:http://www.360buy.com/watch.html [] \xE9\x92\x9F\xE8\xA1\xA8
com.360buy.www:http/ ol:http://www.360top.com/ [] 360TOP \xE5\xA5\xA2\xE4\xBE\x88\xE5\x93\x81
com.360buy.www:http/ ol:http://www.ehaoyao.com/ [] \xE4\xBA\xAC\xE4\xB8\x9C \xE5\xA5\xBD\xE8\x8D\xAF\xE5\xB8\x88
com.360buy.www:http/ ol:http://www.minitiao.com/ [] \xE8\xBF\xB7\xE4\xBD\xA0\xE6\x8C\x91
com.360buy.www:http/ ol:http://xiaoyuan.360buy.com/ [] \xE6\xA0\xA1\xE5\x9B\xAD\xE9\xA2\x91\xE9\x81\x93
com.360buy.www:http/ p:c [] \xE4\xBA\xAC\xE4\xB8\x9C\xE7\xBD\x91\xE4\xB8\x8A\xE5\x95\x86\xE5\x9F\x8E-\xE7\xBB\xBC\xE5\x90\x88\xE7\xBD\x91\xE8\xB4\xAD\xE9\xA6\x96\xE9\x80\x89\xEF\xBC\x8C\xE6\xAD\xA3\xE5\x93\x81\xE8\xA1\x8C\xE8
com.360buy.www:http/ p:sig [] HNC\xF3\x87\xEF\x8E\xD1mB\xE4\xE3\xA2\xA3\x1D\xEA
com.360buy.www:http/ p:st [] \x02\x00\x00
com.360buy.www:http/ p:t [] \xE4\xBA\xAC\xE4\xB8\x9C\xE7\xBD\x91\xE4\xB8\x8A\xE5\x95\x86\xE5\x9F\x8E-\xE7\xBB\xBC\xE5\x90\x88\xE7\xBD\x91\xE8\xB4\xAD\xE9\xA6\x96\xE9\x80\x89\xEF\xBC\x8C\xE6\xAD\xA3\xE5\x93\x81\xE8\xA1\x8C\xE8
com.360buy.www:http/ s:s [] ?\x80\x00\x00
5. ./nutch updatedb
6../nutch solrindex http://localhost:8983/solr/ 1349277947-925721513
./nutch solrindex http://localhost:8983/solr/ -reindex
分享到:
相关推荐
1.1 Solr 简介 1.1.1 Solr 的特性 1.1.2 Solr 的目录结构 1.1.3 Solr 与Lucene 关系 1.2 Solr 安装 1.2.1 环境介绍 1.2.2 安装Solr 1.2.3 结合Nutch
Nutch+solr + hadoop相关框架搭建教程
Linux 下 Nutch 单机配置
eclipse配置nutch,eclipse配置nutch
本项目是基于Apache Nutch和Solr开发的AJAX页面内容爬取与处理设计源码,主要使用Java进行开发。项目共包含1064个文件,其中Java源代码文件458个,XML配置文件181个,文本文件81个,HTML页面文件56个,JPG图片文件56...
Nutch搜索引擎·Nutch简介及安装(第1期) Nutch搜索引擎·Solr简介及安装(第2期) Nutch搜索引擎·Nutch简单应用(第3期) Nutch搜索引擎·Eclipse开发配置(第4期) Nutch搜索引擎·Nutch浅入分析(第5期)
自己写的 hadoop nutch solr 环境搭建手册,成功搭建后写的,会有红色标注容易出错的地方
Nutch 和 Solr (参见 ) 版本 1. 索尔 Solr 用于 8.5.1(或 7.3.1)版本wget http://archive.apache.org/dist/lucene/solr/8.5.1/solr-8.5.1.tgz 2. 阿帕奇纳奇 使用 Apache Nutch 版本 1.17(或 1.16)。 wget ...
基于Apache Nutch和Solr以及Htmlunit, Selenium WebDriver等组件扩展,实现对于AJAX加载类型页面的完整页面内容爬取、解析、清洗、持久化、全文检索等处理
nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据
1.1 环境准备 1.1.1 本期引言 1.1.2 环境介绍 1.1.3 JDK 安装配置 1.1.4 ANT 安装配置 1.1.5 IvyDE 安装配置 1.1.5 Tomcat 安装配置 1.1.6 Cygwin 安装配置 1.2 Eclipse 开发 ...1.2.3 Solr 与Nutch 结合
里面描述了Nutch的基本流程,Nutch与eclipse的结合,Nutch与Solr的结合
DDH垂直搜索引擎系统是一个Java实现的垂直搜索引擎系统,是一套整合了Nutch/UCI/SOLR的网络信息整合系统。借助DDH你可以快速构建多领域的垂直搜索引擎系统。目前DDH整合了Nutch2.2.1+UCI1.0+SOLR4。
Lucene2.0+Nutch0.8 API帮助文档,以前每次查看他们的API都得通过他们的网站去获取,实在麻烦。功夫不负有心人,通过自己的努力终于获得他们CHM格式的API,现在拿来跟大家分享一下
Nutch 是一个开源的、Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。...8.3.2使用OpenSearch API...55 8.4 nutch的应用前景.57 附录一: nutch的相关网站......58 附录二: 参考文献..58
Nutch Htmlunit Plugin 重要说明: 当前项目基于Nutch 1.X系列已停止更新维护,转向Nutch 2.x系列版本的新项目:http://www.oschina.net/p/nutch-ajax 项目简介 基于Apache Nutch 1.8和Htmlunit...
资源名称:Nutch相关框架视频教程资源目录:【】Nutch相关框架视频教程1_杨尚川【】Nutch相关框架视频教程2_杨尚川【】Nutch相关框架视频教程3_杨尚川【】Nutch相关框架视频教程4_杨尚川【】Nutch相关框架视频教程5_...
nutch2.2.1安装步骤,需要自己下载以下软件: apache-ant-1.10.5-bin.tar.gz apache-nutch-2.2.1-src.tar.gz apache-tomcat-8.5.39.tar.gz jdk-8u201-linux-x64.tar.gz solr-4.10.3.zip
学习nutch 源码解读 轻松入门 搭建自己的nutch搜索引擎
外,很多 Lucene 工具(如 Nutch、 Luke)也可以使用 Solr 创建的索引 Solr 的特性包括: 高级的全文搜索功能 专为高通量的网络流量进行的优化 基于开放接口(XML 和 HTTP)的标准 综合的 HTML 管理界面 可...