使用 Google Search API 优雅地搜索互联网

如何获取最新资讯?我发现搜索引擎是一个很好的信息来源。尤其是构思出关键词,以程序化方式,定期、自动地获取,就像 RSS 订阅一样。

最初,我基于 qutebrowser 写了一个浏览器自动化工具,对众多关键词进行定期抓取。但是,Google 的反爬比较厉害,很容易就被识别为异常流量了。

之后,我了解到 Google 提供搜索 API,允许开发人员以编程方式访问搜索结果。

并且还有一定的免费额度,定价方面,每天免费提供100次API调用,之后每1000次为5美元,最多可达每天1万次。这个额度足够我个人使用了。

于是,我打算改用正规的 API。在本文中记录了我的实践过程。如果你有好的想法或建议,欢迎在评论区交流!


视频教程

幸运的是,我搜到了一个视频教程,手把手教我们实现本文目标:

下面是我记录的大致流程,具体的请参见原视频:

相关文档地址:


API

文档:📑 Google Search API Documentation

请求参数:


代码实现

上一节的视频中给出了代码实现,仅依赖 Request(Pandas 可以不依赖)。在本节中,我将视频中的代码录入如下:

import re
import requests
import pandas as pd

def build_payload(query, start=1, num=10, date_restrict='m1', **params):
	"""
	Function to build the payload for the Google Search API request.

	:param query: Search term
	:param start: The index of the first results to return
	:param link_site: Specifies that all search results should contain a link to a particular URL
	:param search_type: Type of search (default is undefined, 'IMAGE' for image search)
	:param date_restrict: Restricts results based on recency (default is one month 'm1')
	:param params: Additional parameters for the API request

	:return: Dictionary containing the API request parameters
	"""
	payload = {
		'key': API_KEY,
		'q': query,
		'cx': SEARCH_ENGINE_ID,
		'start': start,
		'num': num,
		'dateRestrict': date_restrict
	}
	payload.update(params)
	return payload

def make_request(payload):
	"""
	Function to send a GET request to the Google Search API and handle potential errors.

	:param payload: Dictionary containing the API request parameters
	:return: JSON response from the API
	"""
	response = requests.get('https://www.googleapis.com/customsearch/v1', params=payload)
	if response.status_code != 200:
		raise Exception('Request failed')
	return response.json()

def main(query, result_total=10):
	"""
	Main function to execute the script
	"""
	items = []
	reminder = result_total % 10
	if remainder > 0:
		pages = (result_total // 10) + 1
	else:
		pages = result_total // 10

	for i in range(pages):
		if pages == i + 1 and reminder > 0:
			payload = build_payload(query, start=(i+1)*10, num=reminder)
		else:
			payload = build_payload(query, start=(i+1)*10)
		response = make_request(payload)
		items.extends(response['items'])
	query_string_clean = clean_filename(query)
	# save items to file using pandas

返回值

在接口返回中,有一个 items 字段,里面包含返回结果。其中每个元素字段:


其他网络资料

Programmatically searching google in Python using custom search - Stack Overflow

What are the alternatives now that the Google web search API has been deprecated? - Stack Overflow

googleapis/google-api-python-client: 🐍 The official Python client library for Google's discovery based APIs.
bisohns/search-engine-parser: Lightweight package to query popular search engines and scrape for result titles, links and descriptions

web scraping - Is it ok to scrape data from Google results? - Stack Overflow


本文作者:Maeiee

本文链接:使用 Google Search API 优雅地搜索互联网

版权声明:如无特别声明,本文即为原创文章,版权归 Maeiee 所有,未经允许不得转载!


喜欢我文章的朋友请随缘打赏,鼓励我创作更多更好的作品!