생활과 데이터: 열린국회정보

열린국회정보에는 150여개 국회 운영 정보가 있지만 이 중에 본회의 회의록을 받아 보려고 합니다.

본회의 회의록 OpenAPI 상세설명에 나온 대로 아래 날짜(2022년 9월 21일)이 포함된 URL을 접속하면

https://open.assembly.go.kr/portal/openapi/nzbyfwhwaoanttzje?DAE_NUM=21&CONf_DATE=2022-09-21

아래 그림에 나오는 XML 파일을 얻을 수 있습니다. 이 중에 회의 제목이 있는 <TITLE>과 회의록 PDF 위치를 알려주는 <PDF_LNK_URL>만 필요합니다.

이 파이썬 프로그램은 21대 국회 첫 본회의부터 2022년 9월 21일까지 회의록을 찾아 회의 제목 이름으로 pdf 파일을 저장합니다.

import urllib.request
import xml.etree.ElementTree as elemTree

months = []
for m in range(12):
    num = str(m+1)
    months.append(num.zfill(2))

days = []
for d in range(31):
    num = str(d+1)
    days.append(num.zfill(2))
years = ["2020", "2021", "2022"]

base_url = "https://open.assembly.go.kr/portal/openapi/nzbyfwhwaoanttzje?DAE_NUM=21&CONF_DATE="

for y in years:
    for m in months:
        print(y + " : " + m)
        for d in days:
            if y == "2020" and m in ["01","02", "03", "04", "05"]:
                continue
            url = base_url + y + "-" + m + "-" + d             
            resp = urllib.request.urlopen(url)
            xml = resp.read().decode('utf-8')
            #print(xml)

            titles = []
            tree = elemTree.fromstring(xml)
            for t in tree.iter('TITLE'):
                if not t.text in titles:
                    titles.append(t.text)
                    
            links = []
            for p in tree.iter('PDF_LINK_URL'):
                if not p.text in links:
                    links.append(p.text)

            if len(titles) != len(links):
                print(url + ": not equal")
                break
            for ind in range(len(links)):
                print(links[ind])
                pdf_url = links[ind]
                urllib.request.urlretrieve(pdf_url, titles[ind]+".pdf")

회의록에서 관심이 있는 부분은 발언자입니다. 회의록에서 발언자는 모두 “O”로 시작하다는 점에 착안에서 발언자만 뽑아 볼 수 있습니다.

pdfminer를 이용해서 pdf 파일을 text로 바꾸고 “O”와 “의원”이 포함된 줄만 추출해서 출현 빈도수를 계산한 파이썬 프로그램입니다.

#-*- coding: utf-8 -*-
import re
from  pdfminer.high_level import extract_text
import glob

for file in glob.glob("*.pdf"):
    text = extract_text(pdf_file=file)
    #print(text)
    lines = text.split('\n')
    pattern1 ="^◯"
    pattern2 = "의원"
    speakers = {}
    for line in lines:
        if re.search(pattern1, line) and re.search(pattern2, line):
            speaker = line[1:].replace(" ", "")
            if not speaker in speakers:
                speakers[speaker] = 1
            else:
                speakers[speaker] += 1;
    print(file)
    for key, val in speakers.items():
        print(key, val)
    print("")

생활과 데이터

2022년 9월 26일 월요일

열린국회정보

댓글 없음:

댓글 쓰기

신고하기