🗒️python sso爬虫登陆且并发获取数据
2023-10-29
| 2023-11-1
0  |  0 分钟
type
status
date
slug
summary
tags
category
icon
password
需求:需要获取多个应用appid接口的某json下的key和value,需要sso登陆
大致思路:
  • appid是一个列表,循环请求接口
  • appid如果过多,最好使用并发
  • sso登陆拿到cookie

1.下面是需求实现

💡
import gevent from gevent.pool import Pool from gevent import monkey monkey.patch_all()
使用多任务尽量把这些写在最前面,避免出现MonkeyPatchWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError
💡
Gevent 是一个第三方库,可以轻松通过gevent实现并发同步或异步编程,在gevent中用到的主要模式是Greenlet, 它是以C扩展模块形式接入Python的轻量级协程。 Greenlet全部运行在主程序操作系统进程的内部,但它们被协作式地调度。
gevent可以检测io,实现遇到io自动切换另外一个任务
 
💡
sso通常只需要登陆一次,如果确认代码没问题,但是请求api接口一直提示nologin,可以考虑是否还需要登录另外的sso页面。通常这种是做了兼容处理:第一次的sso需要post登录,第二次sso直接使用get拿到cookie,第三次才使用cookie去请求接口
 
import gevent from gevent.pool import Pool from gevent import monkey monkey.patch_all() import json,time import requests #jsondeepfind 本地写的一个递归查询json的函数,最终得到的是一个字典。可以自行实现。 from jsondeepfind import JsonDeepFinder # 设置请求头 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36' } appids =[] def sso_login(username, password, url1, url2): # 设置登录信息,包括用户名、密码和_csrf_token等 login_data = { 'username': username, 'password': password } # 创建Session对象 session = requests.Session() # 发送POST请求登录系统,保存Cookie response1 = session.post(url1, headers=headers, data=login_data) #print(response1.text) response1.raise_for_status() # 抛出异常,如果登录失败 # 访问url2并获取响应数据。通常有第一个sso登陆拿到cookie即可直接访问api接口,但是如果你一直遇到“nologin”报错,考虑是否还需要登录一次,可以在浏览器debug按请求顺序查看。 response2 = session.get(url2, headers=headers) #print(response2.text) response2.raise_for_status() # 抛出异常,如果访问失败 return session def get_api(session,url,appid,rs_list): tmpdict = {} tmpdict["appid"] = appid # 访问url3并获取响应数据,实际要访问的接口 response3 = session.get(url, headers=headers) response3.raise_for_status() # 抛出异常,如果访问失败 # 关闭Session对象,释放资源 session.close() jdata = json_read(response3) tmpdict.update(jdata) #print(tmpdict) rs_list.append(tmpdict) return rs_list def main(username,password, url1, url2,appids): s = sso_login(username, password, url1, url2) time_l = time.time() g_l = [] pool = Pool(60) rs_list = [] for appid in appids: url = generate_api_url(appid) #尽量把主要的请求、构造新数据等需求放在一个函数里,供pool直接调用并挂起 #pool把并发限制在60个 g1 = pool.spawn(get_api,session=s,url=url,appid=appid,rs_list=rs_list) g_l.append(g1) gevent.joinall(g_l) time_r = time.time() print("总耗时:{}".format(time_r-time_l)) print(rs_list) def json_read(response): j = json.loads(response.text) myjsondeepfinder = JsonDeepFinder() keys = ["yourkey"] out = myjsondeepfinder.extract_keys(j,keys) return out def generate_api_url(appid): api_url = 'http://realapi/' + appid +'/jiekou' print(api_url) return api_url if __name__ == '__main__': username = yourusername password = yourpwd appids = ["123456"] url1 = 'https://sso1.com/login/' #第二个sso登陆通常不存在,根据实际删减 url2 = 'https://sso2.com/login/' main(username,password,url1,url2,appids)

2.取出来的数据写入Excel

核心代码
import pandas as pd from openpyxl import Workbook from openpyxl.styles import Font, Border, Side, PatternFill, Alignment def excelCreate(values,tbname,columns): # 创建DataFrame df = pd.DataFrame(values, columns=columns) # 创建Excel工作簿 wb = Workbook() ws = wb.active # 自定义表格样式 header_font = Font(name='Calibri', size=12, bold=True, color='FFFFFF') header_fill = PatternFill(start_color='2F75B5', end_color='2F75B5', fill_type='solid') header_border = Border(left=Side(border_style='thin', color='FFFFFF'), right=Side(border_style='thin', color='FFFFFF'), top=Side(border_style='thin', color='FFFFFF'), bottom=Side(border_style='thin', color='FFFFFF')) cell_font = Font(name='Calibri', size=12, color='000000') cell_border = Border(left=Side(border_style='thin', color='000000'), right=Side(border_style='thin', color='000000'), top=Side(border_style='thin', color='000000'), bottom=Side(border_style='thin', color='000000')) cell_alignment = Alignment(horizontal='center', vertical='center') # 写入表头 for col_num, column_title in enumerate(df.columns, 1): cell = ws.cell(row=1, column=col_num) cell.value = column_title cell.font = header_font cell.fill = header_fill cell.border = header_border # 写入数据 for row_num, row_data in enumerate(df.values, 2): for col_num, cell_value in enumerate(row_data, 1): cell = ws.cell(row=row_num, column=col_num) cell.value = cell_value cell.font = cell_font cell.border = cell_border cell.alignment = cell_alignment # 自适应列宽 for col in ws.columns: max_length = 0 column = col[0].column_letter # Get the column name for cell in col: try: if len(str(cell.value)) > max_length: max_length = len(cell.value) except: pass adjusted_width = (max_length + 2) ws.column_dimensions[column].width = adjusted_width # 保存Excel文件 wb.save(tbname+'.xlsx') print("excel done") def values_create(clist,columns): values = [] for item in clist: tmp_list = [] for i in columns: tmp_list.append(item[i]) tup = tuple(tmp_list) values.append(tup) #print(values) return values
💡
# 创建DataFrame ,使用pd.DataFrame构造表格,values是元组组成的list,columns是表头list df = pd.DataFrame(values, columns=columns)
#如果要在创建表格同时改某些列的值类型,可以使用astype
df = pd.DataFrame(values, columns=columns).astype({'col_1': float, 'col_3': 'datetime64'})表示将col_1列的数据类型转换为float类型,将col_3列的数据类型转换为datetime类型。
需要注意的是,指定列名时需要确保列名存在,否则会抛出KeyError异常。如果希望对所有列的数据类型进行转换,可以省略字典的key,直接指定目标数据类型。例如,astype(float)表示将所有列的数据类型都转换为float类型。
完整代码实现:爬虫获取接口数据,写入Excel
import gevent from gevent.pool import Pool from gevent import monkey monkey.patch_all() import json,time import requests from jsondeepfind import JsonDeepFinder import pandas as pd from openpyxl import Workbook from openpyxl.styles import Font, Border, Side, PatternFill, Alignment # 设置请求头 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36' } appids =[] def sso_login(username, password, url1, url2): # 设置登录信息,包括用户名、密码和_csrf_token等 login_data = { 'username': username, 'password': password } # 创建Session对象 session = requests.Session() # 发送POST请求登录系统,保存Cookie response1 = session.post(url1, headers=headers, data=login_data) #print(response1.text) response1.raise_for_status() # 抛出异常,如果登录失败 # 判断请求返回的状态码 try: response1.raise_for_status() except requests.exceptions.HTTPError as e: print('请求失败:%s,%s' % (url1,e)) # 访问url2并获取响应数据,qunar的登陆页 response2 = session.get(url2, headers=headers) # 判断请求返回的状态码 try: response2.raise_for_status() except requests.exceptions.HTTPError as e: print('请求失败:%s,%s' % (url2,e)) return session def get_api(session,url,appid,rs_list): # 访问url3并获取响应数据,实际要访问的接口 response3 = session.get(url, headers=headers) # 关闭Session对象,释放资源 session.close() # 判断请求返回的状态码 try: response3.raise_for_status() tmpdict = {} tmpdict["appid"] = appid jdata = json_read(response3) tmpdict.update(jdata) # print(tmpdict) rs_list.append(tmpdict) return rs_list except requests.exceptions.HTTPError as e: print('请求失败:%s,%s' % (url,e)) def main(username,password, url1, url2,appids): s = sso_login(username, password, url1, url2) g_l = [] pool = Pool(60) rs_list = [] time_l = time.time() for appid in appids: url = generate_api_url(appid) g1 = pool.spawn(get_api,session=s,url=url,appid=appid,rs_list=rs_list) g_l.append(g1) gevent.joinall(g_l) time_r = time.time() print("总耗时:{}".format(time_r-time_l)) #print(rs_list) return rs_list def json_read(response): j = json.loads(response.text) myjsondeepfinder = JsonDeepFinder() keys = ["yourkey"] out = myjsondeepfinder.extract_keys(j,keys) return out def generate_api_url(appid): api_url = 'http://realapi/' + appid +'/do' #print(api_url) return api_url #columns值的数量要和values值数量一致 def excelCreate(values,tbname,columns): # 创建DataFrame df = pd.DataFrame(values, columns=columns) # 创建Excel工作簿 wb = Workbook() ws = wb.active # 自定义表格样式 header_font = Font(name='Calibri', size=12, bold=True, color='FFFFFF') header_fill = PatternFill(start_color='2F75B5', end_color='2F75B5', fill_type='solid') header_border = Border(left=Side(border_style='thin', color='FFFFFF'), right=Side(border_style='thin', color='FFFFFF'), top=Side(border_style='thin', color='FFFFFF'), bottom=Side(border_style='thin', color='FFFFFF')) cell_font = Font(name='Calibri', size=12, color='000000') cell_border = Border(left=Side(border_style='thin', color='000000'), right=Side(border_style='thin', color='000000'), top=Side(border_style='thin', color='000000'), bottom=Side(border_style='thin', color='000000')) cell_alignment = Alignment(horizontal='center', vertical='center') # 写入表头 for col_num, column_title in enumerate(df.columns, 1): cell = ws.cell(row=1, column=col_num) cell.value = column_title cell.font = header_font cell.fill = header_fill cell.border = header_border # 写入数据 for row_num, row_data in enumerate(df.values, 2): for col_num, cell_value in enumerate(row_data, 1): cell = ws.cell(row=row_num, column=col_num) cell.value = cell_value cell.font = cell_font cell.border = cell_border cell.alignment = cell_alignment # 自适应列宽 for col in ws.columns: max_length = 0 column = col[0].column_letter # Get the column name for cell in col: try: if len(str(cell.value)) > max_length: max_length = len(cell.value) except: pass adjusted_width = (max_length + 2) ws.column_dimensions[column].width = adjusted_width # 保存Excel文件 wb.save(tbname+'.xlsx') print("excel done") def values_create(clist,columns): values = [] for item in clist: tmp_list = [] for i in columns: tmp_list.append(item[i]) tup = tuple(tmp_list) values.append(tup) #print(values) return values if __name__ == '__main__': username = yourusername password = yourpwd appids = ["123456"] url1 = 'https://sso1.com/login/' #第二个sso登陆通常不存在,根据实际删减 url2 = 'https://sso2.com/login/' rs_list = main(username,password,url1,url2,appids) #下面是写入Excel,column_title是人为指定的,需要哪个字段就写一个 column_title1 = 'appid' column_title2 = 'yourkey' columns = [column_title1, column_title2] tbname = 'newfile' # 提取数据,values_create提取的值是直取,故有的数字可能以str传递(因为在接口里就是str) values = values_create(rs_list,columns) excelCreate(values,tbname,columns)
  • Python
  • python实现VLOOKUP功能python递归查询深层嵌套的json
    目录