PubMed-20k-rct EDA

A text dataset containing PubMed medical research abstracts categorized into five structural labels (Background, Objective, Methods, Results, Conclusions)

Modality 2: Text Data

1. Analysis Methodology

▼

This report uses two types of text processing approaches depending on the analysis goal:

Analysis Type	Raw Data	Stopwords Removed	Reason
Basic Statistics (Word/Character counts)	✓	-	Measure actual document size
Stop Words Analysis	✓	-	Analyze stopword frequency
Word Frequency (Top overall words)	-	✓	Find meaningful keywords
Category Keywords	-	✓	Category-specific terms
Vocabulary Richness	-	✓	Content vocabulary diversity
TF-IDF Terms	-	✓	Most distinctive terms
N-grams (Bigrams)	-	✓	Meaningful phrase patterns
Distributions (Word/Char histograms)	✓	-	Show actual text lengths

2. Dataset Overview

Note: Statistics include all words (before stop words removal)

176,642

Total Documents

5

Categories

26.7

Avg Words/Doc

151.3

Avg Chars/Doc

Category Distribution

Category	Count	Percentage
METHODS	59,281	33.56%
RESULTS	57,953	32.81%
CONCLUSIONS	27,168	15.38%
BACKGROUND	18,402	10.42%
OBJECTIVE	13,838	7.83%

Stop Words Analysis

🔴 Raw Data - Analyzing dataset noise

1,507,773

Total Stop Words

31.97%

Of All Words

Word Count Distribution

🔴 Raw Data - Distribution of actual document sizes

Character Count Distribution

🔴 Raw Data - Distribution of actual document sizes

Vocabulary Richness (Type-Token Ratio) by Category

🟢 Stopwords Removed - Measuring content vocabulary diversity

Category	Total Words	Unique Words	Total Texts	Richness Score
OBJECTIVE	196,061	23,275	13,838	11.87%
METHODS	716,277	42,974	59,281	6.00%
RESULTS	692,795	34,538	57,953	4.99%
CONCLUSIONS	329,612	28,838	27,168	8.75%
BACKGROUND	233,129	22,964	18,402	9.85%

Top Words by Category

🟢 Stopwords Removed - Measuring content vocabulary diversity

TF-IDF Weighted Terms by Author

🟢 Stopwords Removed - Measuring content vocabulary diversity

Bigram Analysis

🟢 Stopwords Removed - Measuring content vocabulary diversity

Category Similarity Matrix

🟢 Stopwords Removed - Measuring content vocabulary diversity

Category Distribution by Sentence Position

🟢 Stopwords Removed - Tracking structural patterns

Most Frequent Words Overall

🟢 Stopwords Removed - Measuring content vocabulary diversity

3. Text Statistics

Note: All statistics based on raw text (before stop words removal)

1

Min Words

296

Max Words

24

Median Words

15.28

Std Dev

4. Structural Label Samples

Representative sentences for each section of a medical abstract

Background

[Pos: 0.0]

Emotional eating is associated with overeating and the development of obesity.

[Pos: 0.077]

We tested whether theory-based education increases alarm operability.

Objective

[Pos: 0.0]

To investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain, mobility, and systemic low-grade inflammation...

[Pos: 0.2]

The aim of this study was to test if attention bias for food moderates the effect of self-reported emotional eating during sad mood...

Methods

[Pos: 0.091]

A total of @ patients with primary knee OA were randomized @:@; @ received @ mg/day of prednisolone and @ received placebo for @ weeks.

[Pos: 0.182]

Outcome measures included pain reduction and improvement in function scores and systemic inflammation markers.

Results

[Pos: 0.545]

There was a clinically relevant reduction in the intervention group compared to the placebo group for knee pain, physical function, PGA...

[Pos: 0.727]

Further, there was a clinically relevant reduction in the serum levels of IL-@, IL-@, TNF-, and hsCRP at @ weeks in the intervention group...

Conclusions

[Pos: 1.0]

Low-dose oral prednisolone had both a short-term and a longer sustained effect resulting in less knee pain, better physical function...

[Pos: 1.0]

Results further suggest that attention maintenance on food relates to eating motivation when in a neutral affective state...

5. Key Insights from EDA

✓
Strict Sequential Positioning: The position heatmap reveals that sentence position is the strongest predictor. Background and Objective are strictly at the beginning (0.0–0.15), Methods occupy the middle-early phase (0.2–0.5), Results dominate the latter half (0.5–0.9), and Conclusions appear exclusively at the end (0.9–1.0).
✓
Significant Class Imbalance: The dataset is heavily skewed toward Methods (~59k) and Results (~58k), comprising nearly 75% of the data. Objective and Background are minority classes, meaning evaluation should focus on F1-Score rather than raw accuracy.
✓
Distinct Lexical Signatures: Each category utilizes specific "trigger words." Results are highly quantitative (p-value, confidence interval), Methods are procedural (randomized, participants), and Objectives are dominated by action-oriented verbs (evaluate, compare, assess).
✓
Variability in Vocabulary Richness: Objective (11.8%) and Background (9.8%) have the highest vocabulary richness, indicating diverse phrasing. In contrast, Results (4.9%) has the lowest richness, meaning it is highly formulaic and repetitive.
✓
High Semantic Overlap (Confusion Risk): The similarity matrix indicates a very high correlation between Background & Objective (0.87) and Background & Conclusions (0.82). Những cặp này có rủi ro bị phân loại nhầm cao nhất do dùng chung nhiều từ vựng.