1. Analysis Methodology
▼This report uses two types of text processing approaches depending on the analysis goal:
| Analysis Type | Raw Data | Stopwords Removed | Reason |
|---|---|---|---|
|
Basic Statistics
(Word/Character counts)
|
✓ | - | Measure actual document size |
|
Stop Words Analysis
|
✓ | - | Analyze stopword frequency |
|
Word Frequency
(Top overall words)
|
- | ✓ | Find meaningful keywords |
|
Category Keywords
|
- | ✓ | Category-specific terms |
|
Vocabulary Richness
|
- | ✓ | Content vocabulary diversity |
|
TF-IDF Terms
|
- | ✓ | Most distinctive terms |
|
N-grams (Bigrams)
|
- | ✓ | Meaningful phrase patterns |
|
Distributions
(Word/Char histograms)
|
✓ | - | Show actual text lengths |
2. Dataset Overview
Note: Statistics include all words (before stop words removal)
Category Distribution
| Category | Count | Percentage |
|---|---|---|
| METHODS | 59,281 | 33.56% |
| RESULTS | 57,953 | 32.81% |
| CONCLUSIONS | 27,168 | 15.38% |
| BACKGROUND | 18,402 | 10.42% |
| OBJECTIVE | 13,838 | 7.83% |
Stop Words Analysis
🔴 Raw Data - Analyzing dataset noise
Word Count Distribution
🔴 Raw Data - Distribution of actual document sizes
Character Count Distribution
🔴 Raw Data - Distribution of actual document sizes
Vocabulary Richness (Type-Token Ratio) by Category
🟢 Stopwords Removed - Measuring content vocabulary diversity
| Category | Total Words | Unique Words | Total Texts | Richness Score |
|---|---|---|---|---|
| OBJECTIVE | 196,061 | 23,275 | 13,838 | 11.87% |
| METHODS | 716,277 | 42,974 | 59,281 | 6.00% |
| RESULTS | 692,795 | 34,538 | 57,953 | 4.99% |
| CONCLUSIONS | 329,612 | 28,838 | 27,168 | 8.75% |
| BACKGROUND | 233,129 | 22,964 | 18,402 | 9.85% |
Top Words by Category
🟢 Stopwords Removed - Measuring content vocabulary diversity
TF-IDF Weighted Terms by Author
🟢 Stopwords Removed - Measuring content vocabulary diversity
Bigram Analysis
🟢 Stopwords Removed - Measuring content vocabulary diversity
Category Similarity Matrix
🟢 Stopwords Removed - Measuring content vocabulary diversity
Category Distribution by Sentence Position
🟢 Stopwords Removed - Tracking structural patterns
Most Frequent Words Overall
🟢 Stopwords Removed - Measuring content vocabulary diversity
3. Text Statistics
Note: All statistics based on raw text (before stop words removal)
4. Structural Label Samples
Representative sentences for each section of a medical abstract
5. Key Insights from EDA
-
✓
Strict Sequential Positioning: The position heatmap reveals that sentence position is the strongest predictor. Background and Objective are strictly at the beginning (0.0–0.15), Methods occupy the middle-early phase (0.2–0.5), Results dominate the latter half (0.5–0.9), and Conclusions appear exclusively at the end (0.9–1.0).
-
✓
Significant Class Imbalance: The dataset is heavily skewed toward Methods (~59k) and Results (~58k), comprising nearly 75% of the data. Objective and Background are minority classes, meaning evaluation should focus on F1-Score rather than raw accuracy.
-
✓
Distinct Lexical Signatures: Each category utilizes specific "trigger words." Results are highly quantitative (p-value, confidence interval), Methods are procedural (randomized, participants), and Objectives are dominated by action-oriented verbs (evaluate, compare, assess).
-
✓
Variability in Vocabulary Richness: Objective (11.8%) and Background (9.8%) have the highest vocabulary richness, indicating diverse phrasing. In contrast, Results (4.9%) has the lowest richness, meaning it is highly formulaic and repetitive.
-
✓
High Semantic Overlap (Confusion Risk): The similarity matrix indicates a very high correlation between Background & Objective (0.87) and Background & Conclusions (0.82). Những cặp này có rủi ro bị phân loại nhầm cao nhất do dùng chung nhiều từ vựng.