Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

部分文章无法下载 #8

Open
lion-no-back opened this issue Jul 24, 2022 · 9 comments
Open

部分文章无法下载 #8

lion-no-back opened this issue Jul 24, 2022 · 9 comments

Comments

@lion-no-back
Copy link

下载链接:https://max.book118.com/html/2020/1217/8032056040003027.shtm

发现貌似是只要支持全文预览就能全部下载,那么像这类只能观看一半的文章可以获取吗

@kermsite
Copy link
Contributor

不可以,无法突破付费墙

@lion-no-back
Copy link
Author

这样的话,我看这个方法,好像是类似于用一个配套的浏览器内核,然后爬取打印成pdf的,那能不能下载能预览的ppt时成ppt格式呢,也就是保留原格式,这个如果能做到也很不错

@kermsite
Copy link
Contributor

这个从技术上比较难实现,本身PPT是带有动画的动态格式。而目前是做的截屏形式爬取。当然你可以二次开发,非常欢迎。

@lion-no-back
Copy link
Author

这个有网友实现过,不过他用的是chrome驱动来爬取,我二次开发还远着,目前还处于学习阶段,我觉得你现在做的效果也不错,最起码入门相对简单,多一个选择的机会挺好

@kermsite
Copy link
Contributor

这网友好强。他开源了嘛?学习学习

@lion-no-back
Copy link
Author

没有开源,我从网上搜集来的工具,就是下个与当前Chrome浏览器相同版本的内核,然后调用python去爬取,工具我试了下效果和你这差不多,但稍显麻烦些,然后好像有个油猴脚本可以这么做,Wenku Doc Download,应该没记错,当爬取速度较慢,但也可以,我觉得你们好强啊,咋都能开发一款有用的程序,想问下,你现在大学毕业了吗

@kerm-me
Copy link
Owner

kerm-me commented Aug 13, 2022

没呢,我之后去看看这个

@lion-no-back
Copy link
Author

我要好好学习,拒绝摆烂

@dedicateSky
Copy link

dedicateSky commented Sep 29, 2022

tools.py 中第十六行,有bug,豆丁中获取的文件没有第一页。
修改class筛选条件:
divs = page.query_selector_all("//div[contains( @Class ,'model panel scrollLoading')]")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants