생활과 데이터: 회의록

2022년 10월 18일 화요일

국회 상임위 회의록 구하기

국회의원 상임위 활동 데이터 설명은 아래 link에 있습니다. .

https://open.assembly.go.kr/portal/data/service/selectAPIServicePage.do/OVW2NU000937WK15521#none

요청 주소

https://open.assembly.go.kr/portal/openapi/nuvypcdgahexhvrjt 에 대수, 회기, 차수, 위원회 등을 붙여서 호출하면 상임위 회의록 정보에 얻을 수 있습니다. 결과는 XML 형태이고 회의록 정보는 BILL_URL 태그에 있습니다.

<BILL_URL>http://likms.assembly.go.kr/record/popup_list.do?classCode=1&daeNum=20&sesNum=378&degreeNum=1&commCode=ZA&conferNum=049999</BILL_URL>

이 BILL_URL를 접속해보면 개략적인 회의 순서만 볼 수 있고

순서를 클릭해야 상임위 회의록을 다운로드 받을 수 있습니다. 한두개의 회의록은 웹브라우저 클릭으로 구하는데 어려움이 없지만 천개 넘는 모든 상임위 회의록을 구하려면 자동화 방법이 필요합니다.

위 페이지의 소스를 보면 다운로드는 fn_fileDown() 를 호출해서 이루어집니다.

<ul>

fn_fileDown()는 이 파일에 구현되어 있습니다.

https://likms.assembly.go.kr/record/res/js/mhs/mhs40/mhs40.js

//보존부록 파일 다운로드
function fn_fileDown(conferNum,fileId){
	//
	try{
		//
		var $form    = "";
//	    	alert(conferNum+"/"+fileId);
        $form = $("<form></form>");
        $form.attr("action", "mhs-10-040-0040.do");
        $form.attr("target", "I_TARGET");
        $form.attr("method", "post");
        
        $form.append("<input name='conferNum' type='hidden' value='"+conferNum+"'>");
        $form.append("<input name='fileId' type='hidden' value='"+fileId+"'>");
        $form.append("<input name='deviceGubun' type='hidden' value='P'>");
        //
        $form.appendTo('body');
        //
        $form.submit();
		//
	}catch(e){
        alert("fn_fileDown" + "(" +e.name + ") : " + e.message);
    }
	//
}

conferNum, fileID, deviceGubun를 넣어 POST 방식으로 호출하면 회의록 PDF파일을 구할 수 있습니다.

BILL_URL는 국회정보 일괄 다운로드 페이지에서도 일괄 구할 수 있습니다.

https://open.assembly.go.kr/portal/infs/list/infsListDownPage.do

결과물이 엑셀 파일이고 E열이 회의종류, F가 BILL_URL입니다.

이 엑셀 파일을 읽어 모든 상임위의 BILL_URL를 접근하고 beautifulsoup를 이용하여 conferNum 등을 추출하고 이를 POST data에 넣어 pdf 파일을 다운로드합니다. Post의 response 중 content-disposition에 들어 있는 파일 이름도 추출해서 이 이름 대로 pdf 파일을 저장합니다.

import sys
import openpyxl
from urllib.request import urlopen
import urllib
from bs4 import BeautifulSoup
import re
import requests


excel_file = openpyxl.load_workbook("데이터_국회의원 상임위 활동.xlsx")
sheet = excel_file.active
kinds = sheet['E'] # 회의종류
contents = sheet['F'] # 안건보기
for i in range(1, len(contents)):
    if kinds[i].value == "국회본회의":
        continue
    url = contents[i].value
    print(url)
    resp = urlopen(url)
    htm = str(resp.read().decode('euc-kr'))
    history = []
    bs = BeautifulSoup(htm, "html.parser")
    for x in bs.findAll("a", onclick=True):
        if re.match('javascript:fn_fileDown', x['onclick']):
            args = x['onclick'].split()
            for a in args:
                nums = a.replace("javascript:fn_fileDown(", "").replace(");", "").split(",")
                if not nums in history:
                    history.append(nums)
                    payload = {"conferNum": nums[0].replace("'", ""),
                               "fileId" : nums[1].replace("'", ""),
                               "deviceGubun" : nums[2].replace("'", "")}
                    r = requests.post("http://likms.assembly.go.kr/record/mhs-10-040-0040.do", data=payload)

                    if (r.status_code == 200):
                        d = r.headers['content-disposition']
                        flist = re.findall("filename=(.+)", d)
                        fname = flist[0]
                        # replace %xx escapes with their single-character equivalent
                        fname_2 = urllib.parse.unquote(fname, encoding='utf-8').replace(";", "")
                        print(fname_2)
                        with open(fname_2,   "wb") as f:
                            f.write(r.content)

2022년 9월 26일 월요일

열린국회정보

열린국회정보에는 150여개 국회 운영 정보가 있지만 이 중에 본회의 회의록을 받아 보려고 합니다.

본회의 회의록 OpenAPI 상세설명에 나온 대로 아래 날짜(2022년 9월 21일)이 포함된 URL을 접속하면

https://open.assembly.go.kr/portal/openapi/nzbyfwhwaoanttzje?DAE_NUM=21&CONf_DATE=2022-09-21

아래 그림에 나오는 XML 파일을 얻을 수 있습니다. 이 중에 회의 제목이 있는 <TITLE>과 회의록 PDF 위치를 알려주는 <PDF_LNK_URL>만 필요합니다.

이 파이썬 프로그램은 21대 국회 첫 본회의부터 2022년 9월 21일까지 회의록을 찾아 회의 제목 이름으로 pdf 파일을 저장합니다.

import urllib.request
import xml.etree.ElementTree as elemTree

months = []
for m in range(12):
    num = str(m+1)
    months.append(num.zfill(2))

days = []
for d in range(31):
    num = str(d+1)
    days.append(num.zfill(2))
years = ["2020", "2021", "2022"]

base_url = "https://open.assembly.go.kr/portal/openapi/nzbyfwhwaoanttzje?DAE_NUM=21&CONF_DATE="

for y in years:
    for m in months:
        print(y + " : " + m)
        for d in days:
            if y == "2020" and m in ["01","02", "03", "04", "05"]:
                continue
            url = base_url + y + "-" + m + "-" + d             
            resp = urllib.request.urlopen(url)
            xml = resp.read().decode('utf-8')
            #print(xml)

            titles = []
            tree = elemTree.fromstring(xml)
            for t in tree.iter('TITLE'):
                if not t.text in titles:
                    titles.append(t.text)
                    
            links = []
            for p in tree.iter('PDF_LINK_URL'):
                if not p.text in links:
                    links.append(p.text)

            if len(titles) != len(links):
                print(url + ": not equal")
                break
            for ind in range(len(links)):
                print(links[ind])
                pdf_url = links[ind]
                urllib.request.urlretrieve(pdf_url, titles[ind]+".pdf")

회의록에서 관심이 있는 부분은 발언자입니다. 회의록에서 발언자는 모두 “O”로 시작하다는 점에 착안에서 발언자만 뽑아 볼 수 있습니다.

pdfminer를 이용해서 pdf 파일을 text로 바꾸고 “O”와 “의원”이 포함된 줄만 추출해서 출현 빈도수를 계산한 파이썬 프로그램입니다.

#-*- coding: utf-8 -*-
import re
from  pdfminer.high_level import extract_text
import glob

for file in glob.glob("*.pdf"):
    text = extract_text(pdf_file=file)
    #print(text)
    lines = text.split('\n')
    pattern1 ="^◯"
    pattern2 = "의원"
    speakers = {}
    for line in lines:
        if re.search(pattern1, line) and re.search(pattern2, line):
            speaker = line[1:].replace(" ", "")
            if not speaker in speakers:
                speakers[speaker] = 1
            else:
                speakers[speaker] += 1;
    print(file)
    for key, val in speakers.items():
        print(key, val)
    print("")