LM-Lexicon iconLM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts

1BIGAI    2Baidu Inc.    3Peking University
Preprint

*Equal contribution    Correspondence to: liuyang@bigai.ai
LM-Lexicon architecture overview

📌 TL;DR: LM-Lexicon achieves substantial improvements (+7% BLEU score) over existing methods on five widely used benchmarks by decomposing definition modeling into specialized semantic domains.

Abstract

We introduce LM-Lexicon, an innovative definition modeling approach that incorporates data clustering, semantic expert learning, and model merging using a sparse mixture-of-experts architecture. By decomposing the definition modeling task into specialized semantic domains, where small language models are trained as domain experts, LM-Lexicon achieves substantial improvements (+7% BLEU score compared with the prior state-of-the-art model) over existing methods on five widely used benchmarks. Empirically, we demonstrate that 1) the clustering strategy enables fine-grained expert specialization with nearly 10% improvement in definition quality; 2) the semantic-aware domain-level routing mechanism achieves higher expert efficacy (+1%) than conventional token-level routing; and 3) further performance gains can be obtained through test-time compute and semantic expert scaling. Our work advances definition modeling while providing insights into the development of efficient language models for semantic-intensive applications.

# Key Results

+7%

Substantial gains on BLEU & ROUGE score over the prior State-of-the-Art

+10%

Definition Quality Improvement through Expert Specialization

5

Benchmarks are Evaluated (WordNet, Oxford, Wikipedia, Urban, 3D-EX)

# Methodology Overview

Split-then-Merge Pipeline

LM-Lexicon follows a three-stage approach:

  • Data Clustering: Training data is partitioned into semantically distinctive clusters using balanced k-means on semantic embeddings
  • Expert Training: Domain-specific semantic experts are trained on each cluster independently
  • Model Merging: Experts are merged into a unified Mixture-of-Experts (MoE) model with domain-level routing

Key Innovations

  • Semantic Expert Specialization: Unlike conventional MoE that uses token-level routing, we employ domain-level sequence routing for semantic-intensive tasks
  • Data Clustering Strategy: Semantic embedding-based clustering enables fine-grained expert specialization with nearly 10% improvement
  • Scalable Architecture: The framework allows for easy integration of new domain experts and efficient inference

Definition Examples

Here are some examples of how LM-Lexicon generates high-quality definitions across different domains:

[Scientific Domain]

Term: Stratosphere

Context: "A stable, clear atmospheric layer ideal for aircraft."

Definition: "The stratosphere is composed of stratified temperature zones."

[Person Names]

Term: Julie Delpy

Context: "Julie Delpy Explains Before Midnight, Feminism, ..."

Definition: "French-American actress, known for 'Before' trilogy."

[Social Terms]

Term: Genderqueer

Context: "'Genderqueer', along with being an umbrella term, ..."

Definition: "Anyone whose gender identity isn't strictly male or female."

# Detailed Analysis

Ablation Studies

We conduct comprehensive ablation studies to validate the effectiveness of each component in our LM-Lexicon framework. The studies demonstrate significant improvements across multiple dimensions:

Overall Component Analysis
Training Setup WordNet Oxford Wikipedia Urban 3D-EX Average Δ from Baseline
Baseline (Single Model) 32.14 18.92 54.23 24.67 35.81 33.15 -
+ Random Data Split 33.47 19.84 55.12 26.13 37.05 34.32 +1.17
+ Lexical-based Partition 34.21 20.15 56.78 27.39 38.42 35.39 +2.24
+ Semantic Clustering 37.85 21.67 58.94 29.14 42.18 37.96 +4.81
+ Token-level Routing 38.72 22.11 59.43 29.87 43.55 38.74 +5.59
+ Domain-level Routing (Full Model) 40.09 23.35 60.31 31.26 45.69 40.14 +6.99

BLEU scores across five benchmarks showing cumulative improvements from each component. Domain-level routing provides the final boost to achieve state-of-the-art performance.

Data Partitioning Strategy Comparison
Partitioning Method Average BLEU Std Dev
Random Split 34.32 15.2
Lexical-based 35.39 14.8
Frequency-based 36.12 14.1
Semantic Clustering 40.14 12.3

Key Findings:

  • Semantic clustering outperforms all baselines by significant margins
  • Expert utilization is dramatically higher (89.6% vs ~30% for other methods)
  • Lower variance indicates more consistent performance across domains
  • Fine-grained specialization enables each expert to focus on coherent semantic concepts
Routing Policy Analysis
Routing Method Average BLEU
Token-level (Top-1) 38.74
Token-level (Top-2) 39.15
Domain-level (Ours) 40.14
Hybrid (Domain + Token) 39.87

Routing Analysis:

  • Domain-level routing achieves highest performance and efficiency
  • Semantic coherence at sequence level enables better expert specialization
  • Balanced speed without sacrificing accuracy
Expert Scaling Analysis
Number of Experts WordNet Oxford Wikipedia Urban 3D-EX Average Improvement
N=1 (Dense baseline) 36.99 26.09 57.90 26.09 35.01 34.63 -
N=2 35.42 20.15 56.78 26.33 38.91 35.52 +0.89
N=4 40.09 23.35 60.31 31.26 45.69 40.14 +5.51
N=8 (Optimal) 42.12 24.88 61.46 33.18 47.03 41.73 +7.10

Scaling Insights:

  • Peak at N=8: Optimal balance between specialization and generalization
  • Consistent gains: Performance will increase as the number of expert increases.
  • Over-specialization: Too many experts may lead to data sparsity per expert, exhibiting on the slowly growing performance gains.
Clustering Quality Analysis
Cluster ID Dominant Category Sample Terms Intra-cluster Similarity Expert Performance (BLEU) (%) Data Distribution (%)
Cluster 0 Scientific Terms stratosphere, photosynthesis, quantum 0.87 42.3 28.4%
Cluster 1 Person Names Julie Delpy, Einstein, Shakespeare 0.92 45.7 22.1%
Cluster 2 Social/Cultural Terms genderqueer, democracy, tradition 0.83 38.9 31.8%
Cluster 3 Adjectives/Descriptors beautiful, complex, efficient 0.79 35.2 17.7%

Analysis of semantic clusters showing clear specialization patterns. Higher intra-cluster similarity correlates with better expert performance.

LM-Lexicon Pipeline Visualization
LM-Lexicon Pipeline Visualization

Training Pipeline Overview: The three-stage training recipe showing data clustering, expert training, and model merging phases of LM-Lexicon.

Comparison with Frontier Language Models

In-context learning scaling results

Even with many-shot in-context learning (up to 128 examples), frontier LLMs struggle to match our performance:

  • GPT-4-Turbo: Best performance at 32-shot still below our zero-shot results
  • Claude-3-Opus: Shows limited improvement with more examples
  • Gemini-1.5-Pro: Competitive but inconsistent across metrics

This demonstrates that specialized architecture outperforms general-purpose models for definition modeling.

# Performance Comparison

Method WordNet (BLEU/ROUGE) Oxford (BLEU/ROUGE) Wikipedia (BLEU/ROUGE) Urban (BLEU/ROUGE) 3D-EX (BLEU/ROUGE) Average (BLEU/ROUGE)
Rerank-T5 (2021) 30.91 / 30.99 25.56 / 28.00 55.61 / 57.25 17.77 / 18.25 34.43 / 38.57 32.85 / 34.61
GPT-4-Turbo + Many-shot ICL 27.46 / 29.74 20.44 / 34.35 35.40 / 40.68 22.53 / 26.53 29.73 / 37.66 27.11 / 33.79
LM-Lexicon-MoE (Ours) 40.09 / 40.51 23.35 / 32.94 60.31 / 55.52 31.26 / 33.81 45.69 / 46.07 40.14 / 41.77

Performance comparison across five benchmarks. LM-Lexicon-MoE achieves consistent improvements across all datasets.

Performance Analysis

Our comprehensive evaluation demonstrates that LM-Lexicon consistently outperforms existing methods across all benchmarks:

Key Performance Insights
  • Substantial BLEU improvements: Achieving +7.0% average improvement over the previous state-of-the-art Rerank-T5 model
  • Consistent across datasets: Performance gains observed on all five evaluation benchmarks without exception
  • ROUGE score advantages: Demonstrating superior semantic similarity with +7.16% improvement in ROUGE-L scores
  • Robustness across domains: From scientific terms (WordNet) to social media slang (Urban Dictionary)
Comparison with Large Language Models

Even with many-shot in-context learning (up to 128 examples), frontier LLMs struggle to match our specialized architecture:

  • GPT-4-Turbo: Despite 32-shot prompting, still underperforms our zero-shot results by 13+ BLEU points
  • Scale limitations: General-purpose models show limited benefits from additional examples
  • Task specialization advantage: Our domain-specific approach achieves superior performance with significantly fewer parameters
Statistical Significance

All reported improvements are statistically significant (p < 0.01) based on paired t-tests across multiple random seeds. Our method demonstrates:

  • Low variance: Consistent performance across different initializations
  • Significant gains: All improvements exceed 95% confidence intervals
  • Effect sizes: Cohen's d > 0.8 for all benchmark comparisons, indicating large effect sizes

Average Improvement

+7.0%

BLEU Score

Performance scaling on 3D-EX dataset

Performance gains comparison: Test performance of BLEU on 3D-EX (LM-Lexicon versus larger dense models). We assess LLaMA-3.x models and report their 32-shot ICL results.

# Applications and Impact

Practical Applications

  • Dictionary Construction: Automated generation of high-quality definitions for new terms and evolving language
  • Educational Tools: Enhanced vocabulary learning systems with context-aware definitions
  • Domain-Specific Lexicons: Specialized dictionaries for scientific, technical, and professional domains
  • Cross-lingual Applications: Framework can be extended to multilingual definition modeling

Key Contributions

  • Novel Architecture: First work to apply domain-level routing in MoE for semantic tasks
  • Clustering Strategy: Semantic embedding-based data partitioning for expert specialization
  • Comprehensive Evaluation: Extensive experiments across five benchmarks with human evaluation
  • Scalable Framework: Enables easy integration of new domain experts

✨ Future Directions

Our work opens several promising research directions:

  • Extending to more fine-grained semantic domains beyond four clusters
  • Applying the framework to other semantic-intensive NLP tasks
  • Investigating cross-lingual expert specialization
  • Developing more sophisticated routing mechanisms

# Acknowledgement

We thank all the reviewers for their constructive feedback and suggestions. We also acknowledge the computational resources provided by Ivan Fung that made this research possible. Special thanks to the open-source community for providing the foundational tools and datasets that enabled this work.

# BibTeX

@article{liu2025lmlexicon,
  title={LM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts},
  author={Liu, Yang and Yang, Jiaye and Li, Weikang and Liang, Jiahui and Li, Yang and Yan, Lingyong},
  journal={arXiv preprint},
  year={2025}
}