Why Generating LLM Skills After Task Completion Works Better

New research reveals LLMs struggle to create useful skills before solving tasks, but developers find success by generating skills after problem-solving to capture learned insights.

A recent study from Carnegie Mellon University and Microsoft Research reveals a counterintuitive truth about LLM-generated "skills"—pre-defined prompts designed to help language models perform specific tasks. While these skills are valuable when used by LLMs, the research found that LLMs cannot reliably author their own skills before attempting a task. According to the paper:

"Self-generated skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming."

The Flawed Pre-Task Approach

The study's method asked models to generate skills before solving a task using this prompt structure:

Analyze task requirements
Write 1–5 modular skill documents
Save skills as markdown files
Solve the task using those skills

This approach mirrors common prompting strategies like "think step by step" but fails because LLMs already engage in internal reasoning before responding. Forcing them to formalize skills upfront bakes in flawed assumptions without real-world validation.

The Post-Task Alternative That Works

Developers report success by reversing the sequence: Solve the task first, then generate the skill. This captures hard-won insights gained through trial and error. For example, when experimenting with Sparse Autoencoders (SAEs) to manipulate model features (like Anthropic's Golden Gate Claude technique), several critical lessons emerged only after iterative attempts:

Extracting features from the final layer normalization is ineffective
Optimal feature extraction occurs mid-network
SAEs require orders of magnitude more training data than initially assumed

Only after solving these challenges could an LLM distill the process into a reusable skill (feature_extraction.md). When applied to new models, this post-task skill worked immediately.

Why Timing Matters

The key distinction lies in the origin of knowledge:

Pre-task skills rely on the model's pretrained biases, often containing inaccuracies
Post-task skills encode newly discovered procedural knowledge from actual problem-solving

As one developer notes: "Skills should distill lessons from millions of tokens of iteration, not textbook knowledge." This makes them valuable for recurring workflows like data transformation, API integrations, or debugging procedures.

Community Response

The findings sparked discussion among practitioners:

Many confirm anecdotal success with post-task skill generation
Some question if task complexity determines effectiveness
Others explore hybrid approaches combining human oversight with automated distillation

For now, the consensus is clear: Treat LLM-generated skills as documentation of conquered challenges, not speculative roadmaps. As one Hacker News commenter put it: "First climb the mountain, then draw the map."

Further Reading: CMU Paper on Self-Generated Skills