用于股票量化投资的DART电子公告系统爬虫方法

KissCuseMe

2025-03-11

我们将逐步介绍如何从DART电子公告系统中爬取财务报表数据。我们将使用Python的requests、BeautifulSoup和pandas库。请注意，在实际使用时，请注意网站结构的更改，并注意过度请求可能会给服务器带来负担。

1. 安装所需的库

pip install requests beautifulsoup4 pandas openpyxl

2. DART公告搜索和财务报表爬取示例

import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urljoin

<br/>

# Search Criteria (e.g., Samsung Electronics (005930) Annual Report)
COMPANY_CODE = "005930"  # Stock Code
START_DATE = "20230101"  # Search Start Date (YYYYMMDD)
END_DATE = "20231231"    # Search End Date (YYYYMMDD)
REPORT_TYPE = "A001"     # A001: Annual Report, A002: Semi-Annual Report, A003: Quarterly Report

<br/>

# DART Disclosure Search URL
SEARCH_URL = "http://dart.fss.or.kr/dsab001/search.ax"

def get_report_list():
    """Function to fetch the list of DART disclosure reports"""
    params = {
        "currentPage": 1,
        "maxResults": 10,
        "businessCode": COMPANY_CODE,
        "startDate": START_DATE,
        "endDate": END_DATE,
        "reportName": REPORT_TYPE
    }
    response = requests.get(SEARCH_URL, params=params)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup.select(".table_list tr")[1:]  # Extract rows excluding the header

def extract_excel_url(report_url):
    """Function to extract the Excel file URL from the report page"""
    response = requests.get(report_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    excel_link = soup.select_one("a[href*='download.xbrl']")
    if excel_link:
        return urljoin(report_url, excel_link['href'])
    return None

def download_excel(url):
    """Function to download the Excel file and convert it into a DataFrame"""
    response = requests.get(url)
    with open("temp.xlsx", "wb") as f:
        f.write(response.content)
    return pd.read_excel("temp.xlsx", engine='openpyxl')

<br/>

# Main Execution
if __name__ == "__main__":
    reports = get_report_list()
    for idx, report in enumerate(reports[:3]):  # Process up to 3 reports
        # Extract report title and link
        title = report.select_one("td:nth-child(3) a").text.strip()
        report_url = urljoin(SEARCH_URL, report.select_one("td:nth-child(3) a")['href'])
        
        print(f"[{idx+1}] Extracting data from {title}...")
        
        # Extract Excel file URL and download
        excel_url = extract_excel_url(report_url)
        if excel_url:
            df = download_excel(excel_url)
            print(df.head())  # Check the data
        else:
            print("Excel file not found.")

3. 代码说明

搜索条件设置：
- COMPANY_CODE: 股票代码 (例如: 三星电子=005930)
- REPORT_TYPE: A001(年度), A002(半年度), A003(季度)
- 日期范围由START_DATE和END_DATE限制。
报告列表爬取：
- 调用DART公告搜索API以获取报告列表。
- 使用BeautifulSoup进行HTML解析后，提取报告标题和链接。
Excel文件提取：
- 从每个报告页面中找到XBRL格式的Excel文件链接并下载。
- 使用pandas将Excel文件读取为DataFrame。
注意事项
- 动态内容处理：某些页面可能使用JavaScript动态加载。在这种情况下，可能需要使用Selenium。
- 数据一致性：公司之间的Excel文件结构可能不同，因此需要添加列映射逻辑。
- 法律限制：Web爬取时必须遵守DART使用条款。

您可以在此代码的基础上实现其他数据预处理和量化分析逻辑。

股票

量化

投资

爬虫

DART（韩国企业信息公开系统）

目录

用于股票量化投资的DART电子公告系统爬虫方法

1. 安装所需的库

2. DART公告搜索和财务报表爬取示例

3. 代码说明