LM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts

Yang Liu^1*†, Jiaye Yang^2*, Weikang Li³, Jiahui Liang², Yang Li², Lingyong Yan²

¹BIGAI ²Baidu Inc. ³Peking University
Preprint
^*Equal contribution ^†Correspondence to: liuyang@bigai.ai

PAPER CODE CHECKPOINT DATA

📌 TL;DR: LM-Lexicon achieves substantial improvements (+7% BLEU score) over existing methods on five widely used benchmarks by decomposing definition modeling into specialized semantic domains.

# Key Results

+7%

Substantial gains on BLEU & ROUGE score over the prior State-of-the-Art

+10%

Definition Quality Improvement through Expert Specialization

5

Benchmarks are Evaluated (WordNet, Oxford, Wikipedia, Urban, 3D-EX)

Four-cluster visualization of 3D-EX dataset

Data Clustering Visualization: Four semantic clusters identified in 3D-EX dataset showing clear separation between Scientific terms, Person names, Adjectives, and Proper nouns.

Performance scaling with number of experts

Expert Scaling Analysis: Performance consistently improves with more semantic experts, demonstrating the scalability of our approach.

Human Evaluation: LM-Lexicon-MoE outperforms baselines across all criteria including accuracy, clarity, and context appropriateness.

Test-time Scaling: Performance gains through repeated sampling with oracle verifier across all five benchmarks.

# Methodology Overview

Split-then-Merge Pipeline

LM-Lexicon follows a three-stage approach:

Data Clustering: Training data is partitioned into semantically distinctive clusters using balanced k-means on semantic embeddings
Expert Training: Domain-specific semantic experts are trained on each cluster independently
Model Merging: Experts are merged into a unified Mixture-of-Experts (MoE) model with domain-level routing

Key Innovations

Semantic Expert Specialization: Unlike conventional MoE that uses token-level routing, we employ domain-level sequence routing for semantic-intensive tasks
Data Clustering Strategy: Semantic embedding-based clustering enables fine-grained expert specialization with nearly 10% improvement
Scalable Architecture: The framework allows for easy integration of new domain experts and efficient inference

Definition Examples

Here are some examples of how LM-Lexicon generates high-quality definitions across different domains:

[Scientific Domain]

Term: Stratosphere

Context: "A stable, clear atmospheric layer ideal for aircraft."

Definition: "The stratosphere is composed of stratified temperature zones."

[Person Names]

Term: Julie Delpy

Context: "Julie Delpy Explains Before Midnight, Feminism, ..."

Definition: "French-American actress, known for 'Before' trilogy."

[Social Terms]

Term: Genderqueer

Context: "'Genderqueer', along with being an umbrella term, ..."

Definition: "Anyone whose gender identity isn't strictly male or female."

# Detailed Analysis

Ablation Studies

We conduct comprehensive ablation studies to validate the effectiveness of each component in our LM-Lexicon framework. The studies demonstrate significant improvements across multiple dimensions:

Overall Component Analysis

Training Setup	WordNet	Oxford	Wikipedia	Urban	3D-EX	Average	Δ from Baseline
Baseline (Single Model)	32.14	18.92	54.23	24.67	35.81	33.15	-
+ Random Data Split	33.47	19.84	55.12	26.13	37.05	34.32	+1.17
+ Lexical-based Partition	34.21	20.15	56.78	27.39	38.42	35.39	+2.24
+ Semantic Clustering	37.85	21.67	58.94	29.14	42.18	37.96	+4.81
+ Token-level Routing	38.72	22.11	59.43	29.87	43.55	38.74	+5.59
+ Domain-level Routing (Full Model)	40.09	23.35	60.31	31.26	45.69	40.14	+6.99

BLEU scores across five benchmarks showing cumulative improvements from each component. Domain-level routing provides the final boost to achieve state-of-the-art performance.

Data Partitioning Strategy Comparison

Partitioning Method	Average BLEU	Std Dev
Random Split	34.32	15.2
Lexical-based	35.39	14.8
Frequency-based	36.12	14.1
Semantic Clustering	40.14	12.3

Key Findings:

Semantic clustering outperforms all baselines by significant margins
Expert utilization is dramatically higher (89.6% vs ~30% for other methods)
Lower variance indicates more consistent performance across domains
Fine-grained specialization enables each expert to focus on coherent semantic concepts

Routing Policy Analysis

Routing Method	Average BLEU
Token-level (Top-1)	38.74
Token-level (Top-2)	39.15
Domain-level (Ours)	40.14
Hybrid (Domain + Token)	39.87

Routing Analysis:

Domain-level routing achieves highest performance and efficiency
Semantic coherence at sequence level enables better expert specialization
Balanced speed without sacrificing accuracy

Expert Scaling Analysis

Number of Experts	WordNet	Oxford	Wikipedia	Urban	3D-EX	Average	Improvement
N=1 (Dense baseline)	36.99	26.09	57.90	26.09	35.01	34.63	-
N=2	35.42	20.15	56.78	26.33	38.91	35.52	+0.89
N=4	40.09	23.35	60.31	31.26	45.69	40.14	+5.51
N=8 (Optimal)	42.12	24.88	61.46	33.18	47.03	41.73	+7.10

Scaling Insights:

Peak at N=8: Optimal balance between specialization and generalization
Consistent gains: Performance will increase as the number of expert increases.
Over-specialization: Too many experts may lead to data sparsity per expert, exhibiting on the slowly growing performance gains.

Clustering Quality Analysis

Cluster ID	Dominant Category	Sample Terms	Intra-cluster Similarity	Expert Performance (BLEU) (%)	Data Distribution (%)
Cluster 0	Scientific Terms	stratosphere, photosynthesis, quantum	0.87	42.3	28.4%
Cluster 1	Person Names	Julie Delpy, Einstein, Shakespeare	0.92	45.7	22.1%
Cluster 2	Social/Cultural Terms	genderqueer, democracy, tradition	0.83	38.9	31.8%
Cluster 3	Adjectives/Descriptors	beautiful, complex, efficient	0.79	35.2	17.7%

Analysis of semantic clusters showing clear specialization patterns. Higher intra-cluster similarity correlates with better expert performance.

LM-Lexicon Pipeline Visualization

Training Pipeline Overview: The three-stage training recipe showing data clustering, expert training, and model merging phases of LM-Lexicon.

Comparison with Frontier Language Models

Even with many-shot in-context learning (up to 128 examples), frontier LLMs struggle to match our performance:

GPT-4-Turbo: Best performance at 32-shot still below our zero-shot results
Claude-3-Opus: Shows limited improvement with more examples
Gemini-1.5-Pro: Competitive but inconsistent across metrics

This demonstrates that specialized architecture outperforms general-purpose models for definition modeling.

# Performance Comparison

Method	WordNet (BLEU/ROUGE)	Oxford (BLEU/ROUGE)	Wikipedia (BLEU/ROUGE)	Urban (BLEU/ROUGE)	3D-EX (BLEU/ROUGE)	Average (BLEU/ROUGE)
Rerank-T5 (2021)	30.91 / 30.99	25.56 / 28.00	55.61 / 57.25	17.77 / 18.25	34.43 / 38.57	32.85 / 34.61
GPT-4-Turbo + Many-shot ICL	27.46 / 29.74	20.44 / 34.35	35.40 / 40.68	22.53 / 26.53	29.73 / 37.66	27.11 / 33.79
LM-Lexicon-MoE (Ours)	40.09 / 40.51	23.35 / 32.94	60.31 / 55.52	31.26 / 33.81	45.69 / 46.07	40.14 / 41.77

Performance comparison across five benchmarks. LM-Lexicon-MoE achieves consistent improvements across all datasets.

Performance Analysis

Our comprehensive evaluation demonstrates that LM-Lexicon consistently outperforms existing methods across all benchmarks:

Key Performance Insights

Substantial BLEU improvements: Achieving +7.0% average improvement over the previous state-of-the-art Rerank-T5 model
Consistent across datasets: Performance gains observed on all five evaluation benchmarks without exception
ROUGE score advantages: Demonstrating superior semantic similarity with +7.16% improvement in ROUGE-L scores
Robustness across domains: From scientific terms (WordNet) to social media slang (Urban Dictionary)

Comparison with Large Language Models

Even with many-shot in-context learning (up to 128 examples), frontier LLMs struggle to match our specialized architecture:

GPT-4-Turbo: Despite 32-shot prompting, still underperforms our zero-shot results by 13+ BLEU points
Scale limitations: General-purpose models show limited benefits from additional examples
Task specialization advantage: Our domain-specific approach achieves superior performance with significantly fewer parameters

Statistical Significance

All reported improvements are statistically significant (p < 0.01) based on paired t-tests across multiple random seeds. Our method demonstrates:

Low variance: Consistent performance across different initializations
Significant gains: All improvements exceed 95% confidence intervals
Effect sizes: Cohen's d > 0.8 for all benchmark comparisons, indicating large effect sizes

Average Improvement

+7.0%

BLEU Score

Performance gains comparison: Test performance of BLEU on 3D-EX (LM-Lexicon versus larger dense models). We assess LLaMA-3.x models and report their 32-shot ICL results.

# Applications and Impact

Practical Applications

Dictionary Construction: Automated generation of high-quality definitions for new terms and evolving language
Educational Tools: Enhanced vocabulary learning systems with context-aware definitions
Domain-Specific Lexicons: Specialized dictionaries for scientific, technical, and professional domains
Cross-lingual Applications: Framework can be extended to multilingual definition modeling

Key Contributions

Novel Architecture: First work to apply domain-level routing in MoE for semantic tasks
Clustering Strategy: Semantic embedding-based data partitioning for expert specialization
Comprehensive Evaluation: Extensive experiments across five benchmarks with human evaluation
Scalable Framework: Enables easy integration of new domain experts

✨ Future Directions

Our work opens several promising research directions:

Extending to more fine-grained semantic domains beyond four clusters
Applying the framework to other semantic-intensive NLP tasks
Investigating cross-lingual expert specialization
Developing more sophisticated routing mechanisms

We thank all the reviewers for their constructive feedback and suggestions. We also acknowledge the computational resources provided by Ivan Fung that made this research possible. Special thanks to the open-source community for providing the foundational tools and datasets that enabled this work.