Python 爬虫学习笔记之单线程爬虫_ python教程

各位用户为了找寻关于Python 爬虫学习笔记之单线程爬虫的资料费劲了很多周折。这里教程网为您整理了关于Python 爬虫学习笔记之单线程爬虫的相关资料，仅供查阅，以下为您介绍关于Python 爬虫学习笔记之单线程爬虫的详细内容

介绍

本篇文章主要介绍如何爬取麦子学院的课程信息（本爬虫仍是单线程爬虫），在开始介绍之前，先来看看结果示意图

Python 爬虫学习笔记之单线程爬虫

怎么样，是不是已经跃跃欲试了？首先让我们打开麦子学院的网址，然后找到麦子学院的全部课程信息，像下面这样

Python 爬虫学习笔记之单线程爬虫

这个时候进行翻页，观看网址的变化，首先，第一页的网址是 http://www.maiziedu.com/course/list/, 第二页变成了 http://www.maiziedu.com/course/list/all-all/0-2/, 第三页变成了 http://www.maiziedu.com/course/list/all-all/0-3/ ，可以看到，每次翻一页，0后面的数字就会递增1，然后就有人会想到了，拿第一页呢？我们尝试着将 http://www.maiziedu.com/course/list/all-all/0-1/ 放进浏览器的地址栏，发现可以打开第一栏，那就好办了，我们只需要使用 re.sub() 就可以很轻松的获取到任何一页的内容。获取到网址链接之后，下面要做的就是获取网页的源代码，首先右击查看审查或者是检查元素，就可以看到以下界面

Python 爬虫学习笔记之单线程爬虫

找到课程所在的位置以后，就可以很轻松的利用正则表达式将我们需要的内容提取出来，至于怎么提取，那就要靠你自己了，尝试着自己去找规律才能有更大的收获。如果你实在不知道怎么提取，那么继续往下，看我的源代码吧

实战源代码

? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 # coding=utf-8 import re import requests import sys reload(sys) sys.setdefaultencoding("utf8") class spider(): def __init__(self): print "开始爬取内容。。。" def changePage(self, url, total_page): nowpage = int(re.search('/0-(d+)/', url, re.S).group(1)) pagegroup = [] for i in range(nowpage, total_page + 1): link = re.sub('/0-(d+)/', '/0-%s/' % i, url, re.S) pagegroup.append(link) return pagegroup def getsource(self, url): html = requests.get(url) return html.text def getclasses(self, source): classes = re.search('<ul class="zy_course_list">(.*?)</ul>', source, re.S).group(1) return classes def geteach(self, classes): eachclasses = re.findall('<li>(.*?)</li>', classes, re.S) return eachclasses def getinfo(self, eachclass): info = {} info['title'] = re.search(

'<a py" id="highlighter_90957">
			
				?
			
				
					
						
							
								1
							
								2
							
								3
							
								4
							
								5
						
						
							
								
									import requests
								
									 
								
									html=requests.get("http://gupowang.baijia.baidu.com/article/283878")
								
									html.encoding='utf-8'
								
									print(html.text)
							
						
					
				
			
		
	


	第一行引入requests库，第二行使用requests的get方法获取网页源代码，第三行设置编码格式，第四行文本输出。
	把获取到的网页源代码保存到文本文件中：

	
		
			
				?
			
				
					
						
							
								1
							
								2
							
								3
							
								4
							
								5
							
								6
							
								7
						
						
							
								
									import requests
								
									import os
								
									 
								
									html=requests.get("http://gupowang.baijia.baidu.com/article/283878")
								
									html_file=open("news.txt","w")
								
									html.encoding='utf-8'
								
									print(html.text,file=html_file)

Python 爬虫学习笔记之单线程爬虫

python相关推荐

python本月点击排行