WebMagic的架构设计参照了Scrapy
使用 IntelliJ IDEA 新建maven项目
1、依赖文件配置
WebMagicSpider/pom.xml
<dependencies>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.7.3</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.7.3</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.7.3</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
</dependencies>
2、日志文件配置
WebMagicSpider/src/main/resources/log4j.properties
log4j.rootLogger=WARN, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
项目构建
1、爬虫程序编写
WebMagicSpider/src/main/java/BaiduPageProcessor.java
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.ConsolePipeline;
import us.codecraft.webmagic.pipeline.JsonFilePipeline;
import us.codecraft.webmagic.processor.PageProcessor;
public class BaiduPageProcessor implements PageProcessor {
private Site site = Site.me()
.setRetryTimes(1)
.setSleepTime(1000)
.setCharset("utf-8");
public void process(Page page) {
page.putField("title", page.getHtml().css("title", "text").toString());
}
public Site getSite() {
return site;
}
public static void main(String[] args) {
Spider.create(new BaiduPageProcessor())
.addUrl("http:///")
.addPipeline(new ConsolePipeline())
.addPipeline(new JsonFilePipeline("/Users/qmp/myproject/WebMagicSpider"))
.thread(1)
.run();
}
}
2、执行程序
控制台输出
get page: http:///
title: 百度一下,你就知道
文件输出
{"title":"百度一下,你就知道"}