Python MySQL Integration for Financial Data Scraping and Word Cloud Generation

In today’s data-driven world, extracting meaningful insights from real-time financial news is a powerful skill. This guide walks you through a practical Python project: scraping live financial headlines from Jin10 (a popular financial news platform), storing them in a MySQL database, and generating a visually compelling word cloud to identify trending topics.

We'll cover the full workflow—from environment setup and API analysis to database integration and natural language processing—using core Python libraries like requests, pymysql, jieba, and wordcloud.

Environment Setup

Before diving into data extraction, ensure your Python environment is properly configured. The following libraries are essential for this project:

requests: To send HTTP requests and fetch data from the Jin10 API.
pymysql: For connecting Python with a MySQL database.
jieba: A Chinese text segmentation tool crucial for analyzing non-English content.
wordcloud: To generate visual word clouds from textual data.
PIL (Pillow): For image processing, especially when using a background mask.

💡 Note: This project was developed using Python 3.6 on Windows. Some packages like wordcloud may fail to install via pip install wordcloud due to missing compiled binaries.

Installing Packages on Windows

If standard pip installation fails:

Visit a trusted third-party repository such as Christoph Gohlke's Python Extension Packages (not linked here per guidelines).
Download the appropriate .whl file matching your Python version and system architecture (e.g., wordcloud‑1.8.1‑cp36‑cp36m‑win32.whl for Python 3.6, 32-bit).

Install locally using:

pip install path/to/downloaded/wordcloud‑1.8.1‑cp36‑cp36m‑win32.whl

This method reliably resolves dependency and compilation issues common in Windows environments.

👉 Discover how top traders use real-time data analysis to refine their strategies.

API Parameter Analysis

To scrape dynamic content from Jin10's homepage, we analyze its AJAX-based "Load More" feature.

Step-by-Step Inspection:

Open Jin10.com in your browser.
Navigate to the homepage news feed.
Click "Load More" while monitoring the Network tab in Developer Tools (F12).
Identify the XHR request triggered—typically pointing to an endpoint like:
```
https://flash-api.jin10.com/get_flash_list
```

Key Request Details:

Method: GET
Headers:
- x-app-id: Required authentication token (e.g., SO1EJGmNgCtmpcPF)
- x-version: API version (e.g., 1.0.0)
Query Parameters:
- max_time: Timestamp used for pagination (e.g., 2019-08-12 14:18:48)
- channel: News category filter (e.g., -8200 for general financial updates)

🔍 Insight: Each new request uses the timestamp (time) of the last retrieved item as the new max_time, enabling backward pagination through historical data.

Data Scraping and Database Storage

Now that we understand the API structure, let’s build a script to scrape data and store it in MySQL.

1. Create the MySQL Table

Run this SQL command to set up a table named jin10_data:

DROP TABLE IF EXISTS `jin10_data`;
CREATE TABLE `jin10_data` (
  `id` varchar(50) DEFAULT NULL,
  `time` varchar(20) DEFAULT NULL,
  `content` longtext
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

This schema stores unique identifiers, timestamps, and news content.

2. Parse JSON Response

The API returns data in JSON format. The actual news items reside under the data key:

{
  "status": 200,
  "message": "OK",
  "data": [
    {
      "id": "20190812082759520100",
      "time": "2019-08-12 08:27:59",
      "data": {
        "content": "New Zealand Treasury: Asset purchase program is a less effective tool."
      }
    }
  ]
}

We extract id, time, and content from each entry.

3. Full Python Script

import requests
import pymysql

def save(conn, cur, id_val, time_val, content_val):
    sql = 'INSERT INTO jin10_data(id, time, content) VALUES (%s, %s, %s)'
    try:
        cur.execute(sql, (id_val, time_val, content_val))
        conn.commit()
    except Exception as e:
        print(f"Insert error: {e}")

# Setup connection
conn = pymysql.connect(
    host='127.0.0.1',
    user='root',
    password='123456',
    db='python_data',
    charset='utf8'
)
cur = conn.cursor()

# API configuration
url = "https://flash-api.jin10.com/get_flash_list"
headers = {
    "x-app-id": "SO1EJGmNgCtmpcPF",
    "x-version": "1.0.0"
}
params = {
    "max_time": "2019-08-12 14:18:48",
    "channel": "-8200"
}

total_count = 0
while True:
    response = requests.get(url, params=params, headers=headers)
    data = response.json().get('data', [])
    length = len(data)
    
    if length == 0:
        break

    for i in range(length):
        try:
            temp_id = data[i]['id']
            temp_time = data[i]['time']
            temp_content = data[i]['data']['content']
            save(conn, cur, temp_id, temp_time, temp_content)
        except Exception as e:
            print(f"Error processing item {i}: {e}")
    
    total_count += length
    params['max_time'] = data[-1]['time']
    print(f"Next query time: {params['max_time']}")

cur.close()
conn.close()
print(f"Scraping complete. Total records saved: {total_count}")

This script continuously fetches and stores data until no more results are returned.

👉 See how real-time market intelligence powers smarter trading decisions.

Generating a Word Cloud

With financial text now in our database, we can analyze keyword frequency and visualize trends.

1. Preparation

You’ll need:

A TrueType font supporting Chinese characters (e.g., simsun.ttc).
A mask image (mask.png) to shape the word cloud—common choices include globes, dollar signs, or stock charts.

2. Code Implementation

import jieba.analyse
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator
import requests

# Fetch data
news_content = ""
url = "https://flash-api.jin10.com/get_flash_list"
headers = {"x-app-id": "SO1EJGmNgCtmpcPF", "x-version": "1.0.0"}
params = {"max_time": "2019-08-12 14:18:48", "channel": "-8200"}

count = 0
while count < 500:
    response = requests.get(url, params=params, headers=headers).json()
    data = response.get('data', [])
    if not data:
        break
    for item in data:
        news_content += item['data']['content']
    params['max_time'] = data[-1]['time']
    count += len(data)

# Extract keywords using TextRank algorithm
result = jieba.analyse.textrank(news_content, topK=50, withWeight=True)
keywords = {item[0]: item[1] for item in result}
print("Top Keywords:", keywords)

# Generate word cloud
mask_img = np.array(Image.open('./mask.png'))
wc = WordCloud(
    font_path='./simsun.ttc',
    background_color='white',
    max_words=50,
    mask=mask_img
).generate_from_frequencies(keywords)

# Recolor based on image
image_colors = ImageColorGenerator(mask_img)
plt.imshow(wc.recolor(color_func=image_colors), interpolation='bilinear')
plt.axis("off")
plt.show()
wc.to_file('financial_wordcloud.png')

This generates a professional-looking financial keyword cloud, highlighting terms like inflation, GDP, interest rate, and monetary policy.

👉 Unlock advanced analytics tools that turn raw data into actionable insights.

Programming Best Practices & Lessons Learned

This project highlights several key coding principles:

✅ Key Takeaways:

Use type(variable) to inspect data types during development.
Get list length with len(list_name).
Iterate with for i in range(len(list)).
Modularize code: Save functions in separate .py files and import them (import module_name).
Define functions before calling them to avoid NameError.
Manually install .whl files when pip install fails on Windows.
Database operations in Python are significantly simpler than in Java—often just 3–4 lines.
Adjust max_time slightly (e.g., subtract a few seconds) to avoid duplicate entries due to overlapping timestamps.

Frequently Asked Questions (FAQ)

Q: Can this script run continuously for real-time updates?

Yes! Schedule it using tools like cron (Linux) or Task Scheduler (Windows) to run every few minutes and capture live market sentiment.

Q: Why use `jieba.analyse.textrank()` instead of TF-IDF?

TextRank better identifies contextually important terms in short news snippets, whereas TF-IDF favors frequent but potentially less relevant words.

Q: Is web scraping legal for financial news?

Scraping public-facing data for personal analysis is generally acceptable, but avoid high-frequency requests and always review the site’s robots.txt and Terms of Service.

Q: How can I improve word cloud readability?

Choose high-contrast masks, adjust max_words, and filter out common stop words using jieba.analyse.set_stop_words().

Q: Can I store structured metadata (like importance level)?

Yes—the original JSON includes fields like important, tags, and channel. Extend your MySQL table to include these for richer analysis.

Q: What if the API blocks my requests?

Add delays (time.sleep(2)) between calls and consider rotating headers or using session objects to mimic human behavior.

Core Keywords: Python MySQL integration, financial data scraping, word cloud generation, Jin10 API parsing, Chinese text analysis, automated data pipeline, real-time news crawler