一、前言
最近迷上了历史,茶饭不思,在公众号看连载历史小说,却发现不能像小说阅读器一样保存记录,每天看的时候都得想一想昨天看到哪了,这对于懒人,尤其是像我这样记性不是特别好的我是一场灾难。
要是能用手机看就好了,由于是连更,网上也没有下载资源,既然如此,只好亲自上了,用爬虫搞定吧,前后大概花了一天时间
二、开始
2.1 技术选型
关于爬虫,JS
应该更适合,但是nodejs
相关库用的不熟,为了追求速度还是用java
搞吧
技术栈:java
+jsoup
,构建工具用的maven
,json
处理工具用的阿里的fastjson
maven
依赖坐标
Jsoup
1 2 3 4 5
| <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.8.3</version> </dependency>
|
fastjson
1 2 3 4 5 6
| <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.2.62</version> </dependency>
|
2.2 思路分析
爬虫并没有想的这么复杂,只不过用http
相关库发送请求,通过解析dom来完成一些自己想要的操作拔了
如果手动实现,思路应该是这样:手动点开一篇文章,保存到txt
中,然后下一篇,直到最后一篇。
我们需要做的就是告诉爬虫爬完这篇文章如何去爬下一篇,剩下的就是把请求到的文本处理一下保存到txt
中了
那该如何知道请求地址呢
2.3 分析请求地址
我们点开微信 -> 公众号 -> 点开文章 -> 然后左上角浏览器图标通过默认浏览器打开,然后就可以通过浏览器审查工具(F12
)来分析浏览器的行为(请求)了
我是要获取所有文章,所以点开标题下面的话题
点开NetWork
(汉化的小伙伴应该是网络),然后刷新页面,可以看到所有的请求,目前文章有五百多篇,不可能一次全加载出来,猜测是通过分页查询返回json
,然后操作dom
实现的。
于是我们查看相应请求,可以通过在Filter
中添加method:GET
过滤条件完成筛选
然后筛选到几个记录,查看响应内容,确实是章节内容,这样地址就拿到了
响应内容,响应结果有url,方便我们下一步获取文章详情内容
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
| { "base_resp": { "ret": 0 }, "getalbum_resp": { "article_list": [{ "cover_img_1_1": "https://mmbiz.qlogo.cn/sz_mmbiz_jpg/WnJxyMoT64YsTMMylbRsFTjic83kSbKMibjcGUNU6Zdg0QyzJLFwTvvko2WPeprfK2XZ95yZTxgYoVZKavwkXxPg/300", "create_time": "1617326243", "is_pay_subscribe": "0", "is_read": "0", "itemidx": "8", "key": "3247220375_2654026722_8", "msgid": "2654026722", "title": "秦 朝 (11)泰山封禅", "url": "http://mp.weixin.qq.com/s?__biz=MzI0NzIyMDM3NQ==&mid=2654026722&idx=8&sn=44c6f95d0f298aa5d2756d88d79eaa24&chksm=f276ca1ac501430c6669ff4314626e7f3865ef7f58f97c2957f168581d1a9dafb4662a3f27ae#rd", "user_read_status": "0" }, { "cover_img_1_1": "https://mmbiz.qlogo.cn/sz_mmbiz_jpg/WnJxyMoT64YsTMMylbRsFTjic83kSbKMibPVJQ5z7zB0CKa0V8tFYLVteIKXrzvnoVu3nT7nUZmQWw6pZuW7sZAw/300", "create_time": "1617382342", "is_pay_subscribe": "0", "is_read": "0", "itemidx": "1", "key": "3247220375_2654026862_1", "msgid": "2654026862", "title": "秦 朝 (12)鬼迷心窍", "url": "http://mp.weixin.qq.com/s?__biz=MzI0NzIyMDM3NQ==&mid=2654026862&idx=1&sn=9092e2f4911cb2d24cf83d5d4b7948be&chksm=f276cd96c50144800d1aa653421f2e4399042d0867b67e143390d7f2f10e566681f86d8201b8#rd", "user_read_status": "0" }], "base_info": { "is_first_screen": "0" }, "continue_flag": "1", "reverse_continue_flag": "1" } }
|
2.4 分析分页参数
我们点开请求,查看请求的payload
有很多参数没传值,证明是无关紧要,至少不影响请求结果。程序员的直觉告诉我,begin_itemidx
和count
是分页参数,经过在postman
中各种测试,证明了我这一猜想。
count
很好猜,也就是每页请求多少条数据,但是这个begin_itemidx
就很奇怪,查看了响应的数据和请求参数,也没找到规律,但是根据单词,begin_itemidx
好像是beginItemIndex
,也就是从哪个元素开始,既然是一页一页的获取,那这个标识应该就是上一页最后一个吧,经过验证确实如此。
参数和请求地址明确,接下来就是编码
也就是先查询第一页,然后拿着返回数据的最后一条的itemidx
去进行下一页的查询,直到返回空数组
2.6 获取所有的章节
思路上面也分析过了,通过do、while来控制结束条件,结束条件就是返回空数组,注意微信小程序可能为了防止爬虫爬的太轻松,所以返回格式不完全是数组,只有返回数据大于1,才返回数组,否则直接返回对象。
我偷了个懒,直接获取数组,如果抛出异常,则证明不是数组,在用对象获取。
如果爬的太频繁,同时会报错,我也做了处理,如果请求超时,则进入catch
,调用Thread.sleep(3000)
让线程等待三秒再次请求,这样章节也不会漏
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130
| public class Collect extends Thread { public static final String ALBUM_ID = "1940285848978063363"; public static final String ACTION = "getalbum"; public static final String BIZ = "MzI0NzIyMDM3NQ%3D%3D"; public static final String F = "json"; public static final String DIVIDED = "\n\n################# 章节结束 #################\n\n";
public static void main(String[] args) throws Exception { List<News> newsList = getNewsList(); System.out.println("获取到章节数:" + newsList.size()); }
public static List<News> getNewsList() throws Exception { List<News> collections = new ArrayList<>(); boolean flag = true; int beginItemIdx = 1; String beginMsgId = "2654026583"; List<News> news; do { try { news = getListInfo(Integer.toString(beginItemIdx), beginMsgId, 10); if (!news.isEmpty()) { collections.addAll(news); System.out.println("获取到" + news.size() + "条数据"); News lastOne = news.get(news.size() - 1); if (lastOne != null) { beginItemIdx = lastOne.getItemidx(); beginMsgId = lastOne.getMsgid(); System.out.println("最后一条数据的标题为" + lastOne.getTitle()); } } else { flag = false; } } catch (Exception e) { System.out.println("请求失败,三秒后重试"); Thread.sleep(3000); } } while (flag); return collections; }
public static List<News> getListInfo(String beginItemIdx, String beginMsgId, Integer pageSize) throws Exception { if (pageSize >= 20) { pageSize = 20; }
Connection connection = Jsoup.connect("https://mp.weixin.qq.com/mp/appmsgalbum"); connection.ignoreContentType(true);
connection.header("Host", "https://suhaoblog.cn"); connection.header("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0"); connection.header("Accept", "application/json, text/javascript, */*; q=0.01"); connection.header("Connection", "keep-alive"); connection.header("Content-Type", "application/json");
connection.data("action", ACTION); connection.data("album_id", ALBUM_ID); connection.data("begin_itemidx", beginItemIdx); connection.data("begin_msgid", beginMsgId); connection.data("count", pageSize.toString()); connection.data("f", F); connection.data("__biz", BIZ); Document document = connection.get();
JSONObject jsonObject = JSON.parseObject(document.body().text()).getJSONObject("getalbum_resp"); if (jsonObject.containsKey("article_list")) { try { JSONArray list = jsonObject.getJSONArray("article_list"); List<News> news = new ArrayList<>(); for (int i = 0; i < list.size(); i++) { JSONObject fetchObj = list.getJSONObject(i); News temp = toNews(fetchObj); news.add(temp); } return news; } catch (Exception e) { JSONObject jsonObject1 = jsonObject.getJSONObject("article_list"); List<News> news = new ArrayList<>(); News single = toNews(jsonObject1); news.add(single); return news; } } return new ArrayList<>(); }
public static News toNews(JSONObject jsonObject) { News single = new News(); single.setKey(jsonObject.getString("key")); single.setCreateTime(jsonObject.getString("create_time")); single.setTitle(jsonObject.getString("title")); single.setUrl(jsonObject.getString("url")); single.setMsgid(jsonObject.getString("msgid")); single.setItemidx(jsonObject.getInteger("itemidx")); return single; } }
|
2.7 请求详情并写入文件
2.6中我们已经获取到了所有文章章节,接下来就是遍历获取每一篇文章内容,并通过文件流写入txt中
增加两个方法,一个用来获取详情内容,一个用来格式化返回文本(去掉html标签)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
| public class Collect extends Thread { public static final String ALBUM_ID = "1940285848978063363"; public static final String ACTION = "getalbum"; public static final String BIZ = "MzI0NzIyMDM3NQ%3D%3D"; public static final String F = "json"; public static final String DIVIDED = "\n\n################# 章节结束 #################\n\n";
public static void main(String[] args) throws Exception { File file = new File("C:\\Users\\15017\\Desktop\\从秦朝说起,到清朝结束.txt"); FileOutputStream fileOutputStream = new FileOutputStream(file);
fileOutputStream.write("秦 朝 (1)奇货可居".getBytes(StandardCharsets.UTF_8)); fileOutputStream.write(getContent("https://mp.weixin.qq.com/s/fULbB0Ws3DjodEPUTwyWhQ").getBytes(StandardCharsets.UTF_8)); fileOutputStream.write(DIVIDED.getBytes());
List<News> newsList = getNewsList(); System.out.println("获取到章节数:" + newsList.size());
for (int i = 0; i < newsList.size(); i++) { News chapter = newsList.get(i); try { String title = chapter.getTitle() + "\n"; fileOutputStream.write(title.getBytes(StandardCharsets.UTF_8)); String content = getContent(chapter.getUrl()); if (content.length() < 100) { content = "当前文章内容不可查看"; } fileOutputStream.write(content.getBytes(StandardCharsets.UTF_8)); fileOutputStream.write(DIVIDED.getBytes()); System.out.println("写入章节《" + chapter.getTitle() + "》成功"); } catch (Exception e) { e.printStackTrace(); i--; } } newsList.forEach(chapter -> {
});
fileOutputStream.flush(); fileOutputStream.close(); }
public static String getContent(String url) { try { Connection connection = Jsoup.connect(url); connection.ignoreContentType(true);
connection.header("Host", "https://suhaoblog.cn"); connection.header("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0"); connection.header("Accept", "application/json, text/javascript, */*; q=0.01"); connection.header("Connection", "keep-alive"); connection.header("Content-Type", "application/json");
Document document = connection.get(); Elements textNode = document.select(".rich_media_content"); if (textNode != null && textNode.size() > 0) { Elements elements = textNode.get(0).getAllElements(); StringBuilder builder = new StringBuilder(); elements.forEach(item -> { builder.append(item.toString()); }); return format(builder.toString()); } else { System.out.println("当前文件不可查看"); return ""; } } catch (Exception e) { e.printStackTrace(); return ""; } }
public static String format(String htmlStr) { return htmlStr.replaceAll("<[^>]+>", ""); } }
|
三、附录
3.1 News实体类字段
这个字段主要和请求响应数据字段对应
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
| package com.rambler.test.entity;
import java.io.Serializable;
public class News implements Serializable { private static final long serialVersionUID = 969103543763776567L;
private String key;
private String createTime;
private Integer itemidx;
private String msgid;
private String title;
private String content;
private String url;
public String getKey() { return key; }
public void setKey(String key) { this.key = key; }
public String getCreateTime() { return createTime; }
public void setCreateTime(String createTime) { this.createTime = createTime; }
public Integer getItemidx() { return itemidx; }
public void setItemidx(Integer itemidx) { this.itemidx = itemidx; }
public String getMsgid() { return msgid; }
public void setMsgid(String msgid) { this.msgid = msgid; }
public String getTitle() { return title; }
public void setTitle(String title) { this.title = title; }
public String getContent() { return content; }
public void setContent(String content) { this.content = content; }
public String getUrl() { return url; }
public void setUrl(String url) { this.url = url; }
}
|
四、总结
总体没啥难度,了解到一种新的分页思路