In today’s data-driven world, extracting meaningful insights from real-time financial news is a powerful skill. This guide walks you through a practical Python project: scraping live financial headlines from Jin10 (a popular financial news platform), storing them in a MySQL database, and generating a visually compelling word cloud to identify trending topics.
We'll cover the full workflow—from environment setup and API analysis to database integration and natural language processing—using core Python libraries like requests, pymysql, jieba, and wordcloud.
Environment Setup
Before diving into data extraction, ensure your Python environment is properly configured. The following libraries are essential for this project:
requests: To send HTTP requests and fetch data from the Jin10 API.pymysql: For connecting Python with a MySQL database.jieba: A Chinese text segmentation tool crucial for analyzing non-English content.wordcloud: To generate visual word clouds from textual data.PIL(Pillow): For image processing, especially when using a background mask.
💡 Note: This project was developed using Python 3.6 on Windows. Some packages likewordcloudmay fail to install viapip install wordclouddue to missing compiled binaries.
Installing Packages on Windows
If standard pip installation fails:
- Visit a trusted third-party repository such as Christoph Gohlke's Python Extension Packages (not linked here per guidelines).
- Download the appropriate
.whlfile matching your Python version and system architecture (e.g.,wordcloud‑1.8.1‑cp36‑cp36m‑win32.whlfor Python 3.6, 32-bit). Install locally using:
pip install path/to/downloaded/wordcloud‑1.8.1‑cp36‑cp36m‑win32.whl
This method reliably resolves dependency and compilation issues common in Windows environments.
👉 Discover how top traders use real-time data analysis to refine their strategies.
API Parameter Analysis
To scrape dynamic content from Jin10's homepage, we analyze its AJAX-based "Load More" feature.
Step-by-Step Inspection:
- Open Jin10.com in your browser.
- Navigate to the homepage news feed.
- Click "Load More" while monitoring the Network tab in Developer Tools (F12).
Identify the XHR request triggered—typically pointing to an endpoint like:
https://flash-api.jin10.com/get_flash_list
Key Request Details:
- Method:
GET Headers:
x-app-id: Required authentication token (e.g.,SO1EJGmNgCtmpcPF)x-version: API version (e.g.,1.0.0)
Query Parameters:
max_time: Timestamp used for pagination (e.g.,2019-08-12 14:18:48)channel: News category filter (e.g.,-8200for general financial updates)
🔍 Insight: Each new request uses the timestamp (time) of the last retrieved item as the newmax_time, enabling backward pagination through historical data.
Data Scraping and Database Storage
Now that we understand the API structure, let’s build a script to scrape data and store it in MySQL.
1. Create the MySQL Table
Run this SQL command to set up a table named jin10_data:
DROP TABLE IF EXISTS `jin10_data`;
CREATE TABLE `jin10_data` (
`id` varchar(50) DEFAULT NULL,
`time` varchar(20) DEFAULT NULL,
`content` longtext
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;This schema stores unique identifiers, timestamps, and news content.
2. Parse JSON Response
The API returns data in JSON format. The actual news items reside under the data key:
{
"status": 200,
"message": "OK",
"data": [
{
"id": "20190812082759520100",
"time": "2019-08-12 08:27:59",
"data": {
"content": "New Zealand Treasury: Asset purchase program is a less effective tool."
}
}
]
}We extract id, time, and content from each entry.
3. Full Python Script
import requests
import pymysql
def save(conn, cur, id_val, time_val, content_val):
sql = 'INSERT INTO jin10_data(id, time, content) VALUES (%s, %s, %s)'
try:
cur.execute(sql, (id_val, time_val, content_val))
conn.commit()
except Exception as e:
print(f"Insert error: {e}")
# Setup connection
conn = pymysql.connect(
host='127.0.0.1',
user='root',
password='123456',
db='python_data',
charset='utf8'
)
cur = conn.cursor()
# API configuration
url = "https://flash-api.jin10.com/get_flash_list"
headers = {
"x-app-id": "SO1EJGmNgCtmpcPF",
"x-version": "1.0.0"
}
params = {
"max_time": "2019-08-12 14:18:48",
"channel": "-8200"
}
total_count = 0
while True:
response = requests.get(url, params=params, headers=headers)
data = response.json().get('data', [])
length = len(data)
if length == 0:
break
for i in range(length):
try:
temp_id = data[i]['id']
temp_time = data[i]['time']
temp_content = data[i]['data']['content']
save(conn, cur, temp_id, temp_time, temp_content)
except Exception as e:
print(f"Error processing item {i}: {e}")
total_count += length
params['max_time'] = data[-1]['time']
print(f"Next query time: {params['max_time']}")
cur.close()
conn.close()
print(f"Scraping complete. Total records saved: {total_count}")This script continuously fetches and stores data until no more results are returned.
👉 See how real-time market intelligence powers smarter trading decisions.
Generating a Word Cloud
With financial text now in our database, we can analyze keyword frequency and visualize trends.
1. Preparation
You’ll need:
- A TrueType font supporting Chinese characters (e.g.,
simsun.ttc). - A mask image (
mask.png) to shape the word cloud—common choices include globes, dollar signs, or stock charts.
2. Code Implementation
import jieba.analyse
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator
import requests
# Fetch data
news_content = ""
url = "https://flash-api.jin10.com/get_flash_list"
headers = {"x-app-id": "SO1EJGmNgCtmpcPF", "x-version": "1.0.0"}
params = {"max_time": "2019-08-12 14:18:48", "channel": "-8200"}
count = 0
while count < 500:
response = requests.get(url, params=params, headers=headers).json()
data = response.get('data', [])
if not data:
break
for item in data:
news_content += item['data']['content']
params['max_time'] = data[-1]['time']
count += len(data)
# Extract keywords using TextRank algorithm
result = jieba.analyse.textrank(news_content, topK=50, withWeight=True)
keywords = {item[0]: item[1] for item in result}
print("Top Keywords:", keywords)
# Generate word cloud
mask_img = np.array(Image.open('./mask.png'))
wc = WordCloud(
font_path='./simsun.ttc',
background_color='white',
max_words=50,
mask=mask_img
).generate_from_frequencies(keywords)
# Recolor based on image
image_colors = ImageColorGenerator(mask_img)
plt.imshow(wc.recolor(color_func=image_colors), interpolation='bilinear')
plt.axis("off")
plt.show()
wc.to_file('financial_wordcloud.png')This generates a professional-looking financial keyword cloud, highlighting terms like inflation, GDP, interest rate, and monetary policy.
👉 Unlock advanced analytics tools that turn raw data into actionable insights.
Programming Best Practices & Lessons Learned
This project highlights several key coding principles:
✅ Key Takeaways:
- Use
type(variable)to inspect data types during development. - Get list length with
len(list_name). - Iterate with
for i in range(len(list)). - Modularize code: Save functions in separate
.pyfiles and import them (import module_name). - Define functions before calling them to avoid
NameError. - Manually install
.whlfiles whenpip installfails on Windows. - Database operations in Python are significantly simpler than in Java—often just 3–4 lines.
- Adjust
max_timeslightly (e.g., subtract a few seconds) to avoid duplicate entries due to overlapping timestamps.
Frequently Asked Questions (FAQ)
Q: Can this script run continuously for real-time updates?
Yes! Schedule it using tools like cron (Linux) or Task Scheduler (Windows) to run every few minutes and capture live market sentiment.
Q: Why use jieba.analyse.textrank() instead of TF-IDF?
TextRank better identifies contextually important terms in short news snippets, whereas TF-IDF favors frequent but potentially less relevant words.
Q: Is web scraping legal for financial news?
Scraping public-facing data for personal analysis is generally acceptable, but avoid high-frequency requests and always review the site’s robots.txt and Terms of Service.
Q: How can I improve word cloud readability?
Choose high-contrast masks, adjust max_words, and filter out common stop words using jieba.analyse.set_stop_words().
Q: Can I store structured metadata (like importance level)?
Yes—the original JSON includes fields like important, tags, and channel. Extend your MySQL table to include these for richer analysis.
Q: What if the API blocks my requests?
Add delays (time.sleep(2)) between calls and consider rotating headers or using session objects to mimic human behavior.
Core Keywords: Python MySQL integration, financial data scraping, word cloud generation, Jin10 API parsing, Chinese text analysis, automated data pipeline, real-time news crawler