Himabindu Lakkaraju
~Himabindu_Lakkaraju1
17
论文总数
8.5
年均投稿
平均评分
接收情况9/17
会议分布
ICLR
11
NeurIPS
4
COLM
2
发表论文 (17 篇)
202511 篇
4
More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness
ICLR 2025Oral
4
Generalized Group Data Attribution
ICLR 2025Rejected
4
Towards Unifying Interpretability and Control: Evaluation via Intervention
ICLR 2025Rejected
4
Inference-Time Reward Hacking in Large Language Models
NeurIPS 2025Spotlight
4
Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models
NeurIPS 2025Poster
6
On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models
ICLR 2025Rejected
4
Weak-to-Strong Trustworthiness: Eliciting Trustworthiness with Weak Supervision
ICLR 2025Rejected
4
EvoLM: In Search of Lost Language Model Training Dynamics
NeurIPS 2025Oral
4
Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems
ICLR 2025Poster
4
How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence
COLM 2025Poster
4
Quantifying Generalization Complexity for Large Language Models
ICLR 2025Poster
20246 篇
3
In-Context Unlearning: Language Models as Few Shot Unlearners
ICLR 2024Rejected
6
Investigating the Fairness of Large Language Models for Predictions on Tabular Data
ICLR 2024withdrawn
3
Are Large Language Models Post Hoc Explainers?
ICLR 2024Rejected
5
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
NeurIPS 2024Poster
4
Certifying LLM Safety against Adversarial Prompting
ICLR 2024Rejected
4
Certifying LLM Safety against Adversarial Prompting
COLM 2024Poster