ðè«ææ å ±
ðãã®è«æã®ããŒã¡ãã»ãŒãž
- ïŒ1, 2æã§ãŸãšããïŒ
ðã©ãããåé¡ã«åãçµãã ã®ã
- LLMã®å éšè¡šçŸã«ä»å ¥ããææ³ã®è©äŸ¡ãããããã®ãã³ãããŒã¯ããŒã¿ã»ãããæ§ç¯ãã
ð§âðãã®åé¡ã«åãçµãããšããªãéèŠãªã®ã
- LLMã®å éšè¡šçŸã«ä»å ¥ããæ§ã ãªææ³ãææ¡ãããŠããã
- ã ããçµ±äžãããã³ãããŒã¯ãååšããªãããå ¬å¹³ãªè©äŸ¡ãã§ããŠããªããšãã課é¡ãããã
ð¡åé¡è§£æ±ºã«åããããŒã¢ã€ãã¢ã¯äœã
- Concept DetectionãšModel Steeringã®äºã€ã®ææšãè©äŸ¡ããããã®ããŒã¿ã»ãããæ§ç¯ãã
- Concept Detectionã¯ã·ã³ãã«ãªåé¡åé¡
- Model Steeringã¯ãçæããæç« ãLLMãè©äŸ¡ãããã®ã«ãªã
- ããŒã¿ã®çšæã®ããã«ãGPT-4oã䜿çšããããŒã¿æ¡åŒµãè¡ãªãããŠãã
- Concept Dataset Generation
- ããŒã¿ã»ããã®åœ¢åŒã¯PreferenceããŒã¿ã»ãããšåã圢åŒã«ãªã£ãŠãã
- æç€ºãšããžãã£ããªããŒã¿ã¯LLMã«ããçæãããŠãã
- ãã¬ãã£ããªããŒã¿ã«ã¯ãç°ãªãã³ã³ã»ããã«å±ããã¬ã¹ãã³ã¹ã䜿çšããŠãã
- ã¿ã¹ã¯ã®è©äŸ¡ææšã«ã¯ãç¹å®ã®ã¬ã€ã€ãŒã®åããŒã¯ã³ã®äžé衚çŸãçšããŠåé¡åšãäºæž¬ãã確çã®æå€§å€ãçšããŠãã
- åé¡åšã®äºæž¬ã¯[0-1]ã®äžæ¬¡å ã®åºåã«ãªã
- Model Steering
- è©äŸ¡ææš
- LLMãå¿çã0ã1ã2ã®ããããã§è©äŸ¡ãã
- ã¹ã³ã¢ã¯ãConceptãInstructoinãFluencyã®3ã€ã䜿çšãã
- æçµã¹ã³ã¢ã¯ã調åå¹³åã䜿çšããŠãã
- è©äŸ¡ææš
- è«æäžã§å ±åãããŠããã®ã¯ãç¹å®ã®ã¬ã€ã€ãŒã«ãããã¹ã³ã¢ã«ãªã£ãŠãã
- Model Steeringã§ã¯ç¹å®ã®ã¬ã€ã€ãŒã«ä»å ¥ããæã®ã¹ã³ã¢ã«ãªã£ãŠãã
ðæ°ãã«åãã£ãããšã¯äœã
- Concept Detectionã§ã¯ProbeããŒã¹ã®ææ³ããSAEã䜿çšããææ³ãããè¯ãæ§èœã§ãã£ã
- è©äŸ¡ææšã¯ãAUROCãçšããŠãã
- ç¹ã«ãSAEã¯ããŒã¿ã®ãã©ã³ã¹ãæªããšæ§èœãäœäžããåŸåããã
- Model Steeringã«ãããŠã¯ãSAEã®æ¹ãè¯ãæ§èœã§ãããLoRAãSFTãããæ§èœãäœãçµæã§ãã£ã
âçåç¹ã¯äœã
- Model Steeringã®ã¹ã³ã¢ã«ãããŠãå®éçãªãã®ãæ¡çšãããŠããªãã®ãæ°ã«ãªã
- LLMã«ããè©äŸ¡ã ãã§è¯ãã®ãã¯ãšãŠãçå
- Gemma以å€ã®ã¢ãã«ã®æ§èœã¯ã©ããªã®ã ãã