1 问题描述
起始页面 ython 包含许多指向其他词条的页面。通过页面之间的链接访问1000条百科词条。
对每个词条,获取其标题和简介。
2 讨论
首先获取页面源码,然后解析得到自己要的数据。
这里我们通过urllib或者requests库获取到页面源码,然后通过beautifulsoup解析。
可以看到,标题是在
标签下的。
可以看出,简介是在class为lemma-summary的div下的。
可以看出,其他词条的格式都遵循hcom/item/xxx的形式
3 实现
# coding=utf-8from urllib import requestfrom bs4 import BeautifulSoupimport reimport tracebackimport timeurl_new = set()url_old = set()start_url = 'httpm/item/python'max_url = 1000def add_url(url):if len(url_new) + len(url_old) > 1000: returnif url not in url_old and url not in url_new:(url)def get_url():url = ()(url) return urldef parse_title_summary(page):soup = BeautifulSoup(page, '')node = ('h1')title = = ('div', class_='lemma-summary')summary = return title, summarydef parse_url(page):soup = BeautifulSoup(page, '')links = ('a', href=(r'/item/'))res = set()baikeprefix = ''for i in links:(baikeprefix + i['href']) return resdef write2log(text, name='d:/'):with open(name, 'a+', encoding='utf-8') as fp:('\n')(text)if __name__ == '__main__':(start_url) print('working')time_begin=()count = 1while url_new:url = get_url() try:resp = (url)text = ().decode()write2log('.'.join(parse_title_summary(text)))urls = parse_url(text) for i in urls:add_url(i) print(str(count), 'ok')count += 1except:() print(url)time_end=() print('time elapsed: ', time_end - time_begin) print('the end.')
输出结果
working1 ok略983 ok984 oktime elapsed: end.
将urllib替换为第三方库requests:
pip install requests
略if __name__ == '__main__':(start_url) print('working')time_begin = ()count = 1while url_new:url = get_url() try: with () as s:resp = (url)text = () # 默认'utf-8'write2log('.'.join(parse_title_summary(text)))urls = parse_url(text) for i in urls:add_url(i) print(str(count), 'ok')count += 1except:() print(url)time_end = () print('time elapsed: ', time_end - time_begin) print('the end.')
输出
略986 ok987 ok988 ok989 oktime elapsed: end.
一个通用的爬虫架构包括如下四部分: