PubMed-20k-rct EDA

A text dataset containing PubMed medical research abstracts categorized into five structural labels (Background, Objective, Methods, Results, Conclusions)

Modality 2: Text Data

1. Analysis Methodology

This report uses two types of text processing approaches depending on the analysis goal:

Analysis Type Raw Data Stopwords Removed Reason
Basic Statistics
(Word/Character counts)
- Measure actual document size
Stop Words Analysis
- Analyze stopword frequency
Word Frequency
(Top overall words)
- Find meaningful keywords
Category Keywords
- Category-specific terms
Vocabulary Richness
- Content vocabulary diversity
TF-IDF Terms
- Most distinctive terms
N-grams (Bigrams)
- Meaningful phrase patterns
Distributions
(Word/Char histograms)
- Show actual text lengths

2. Dataset Overview

Note: Statistics include all words (before stop words removal)

176,642
Total Documents
5
Categories
26.7
Avg Words/Doc
151.3
Avg Chars/Doc

Category Distribution

Category Count Percentage
METHODS 59,281 33.56%
RESULTS 57,953 32.81%
CONCLUSIONS 27,168 15.38%
BACKGROUND 18,402 10.42%
OBJECTIVE 13,838 7.83%

Stop Words Analysis

🔴 Raw Data - Analyzing dataset noise

1,507,773
Total Stop Words
31.97%
Of All Words

Word Count Distribution

🔴 Raw Data - Distribution of actual document sizes

Character Count Distribution

🔴 Raw Data - Distribution of actual document sizes

Vocabulary Richness (Type-Token Ratio) by Category

🟢 Stopwords Removed - Measuring content vocabulary diversity

Category Total Words Unique Words Total Texts Richness Score
OBJECTIVE 196,061 23,275 13,838 11.87%
METHODS 716,277 42,974 59,281 6.00%
RESULTS 692,795 34,538 57,953 4.99%
CONCLUSIONS 329,612 28,838 27,168 8.75%
BACKGROUND 233,129 22,964 18,402 9.85%

Top Words by Category

🟢 Stopwords Removed - Measuring content vocabulary diversity

TF-IDF Weighted Terms by Author

🟢 Stopwords Removed - Measuring content vocabulary diversity

Bigram Analysis

🟢 Stopwords Removed - Measuring content vocabulary diversity

Category Similarity Matrix

🟢 Stopwords Removed - Measuring content vocabulary diversity

Category Distribution by Sentence Position

🟢 Stopwords Removed - Tracking structural patterns

Most Frequent Words Overall

🟢 Stopwords Removed - Measuring content vocabulary diversity

3. Text Statistics

Note: All statistics based on raw text (before stop words removal)

1
Min Words
296
Max Words
24
Median Words
15.28
Std Dev

4. Structural Label Samples

Representative sentences for each section of a medical abstract

Background
[Pos: 0.0]

Emotional eating is associated with overeating and the development of obesity.

[Pos: 0.077]

We tested whether theory-based education increases alarm operability.

Objective
[Pos: 0.0]

To investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain, mobility, and systemic low-grade inflammation...

[Pos: 0.2]

The aim of this study was to test if attention bias for food moderates the effect of self-reported emotional eating during sad mood...

Methods
[Pos: 0.091]

A total of @ patients with primary knee OA were randomized @:@; @ received @ mg/day of prednisolone and @ received placebo for @ weeks.

[Pos: 0.182]

Outcome measures included pain reduction and improvement in function scores and systemic inflammation markers.

Results
[Pos: 0.545]

There was a clinically relevant reduction in the intervention group compared to the placebo group for knee pain, physical function, PGA...

[Pos: 0.727]

Further, there was a clinically relevant reduction in the serum levels of IL-@, IL-@, TNF-, and hsCRP at @ weeks in the intervention group...

Conclusions
[Pos: 1.0]

Low-dose oral prednisolone had both a short-term and a longer sustained effect resulting in less knee pain, better physical function...

[Pos: 1.0]

Results further suggest that attention maintenance on food relates to eating motivation when in a neutral affective state...

5. Key Insights from EDA