Table of Contents
Current long-context large language models (LLMs) can process inputs up to 100,000 tokens, yet they struggle to generate outputs exceeding even a modest length of 2,000 words. Controlled experiments reveal that the model’s effective generation length is inherently limited by the examples seen during supervised fine-tuning (SFT). In other words, this output limitation stems from the scarcity of long-output examples in existing SFT datasets.
Recent advancements in long-context LLMs have led to the development of models with significantly expanded memory capacities, capable of processing history exceeding 100,000 tokens in length. However, despite their ability to handle extensive inputs, current long-context LLMs struggle to generate equally lengthy outputs.
To explore this limitation, LongWriter probes the maximum output length of state-of-the-art long-context models with multiple queries that require responses of varying lengths, such as “Write a 10,000-word article on the history of the Roman Empire.” The results show that all models consistently fail to produce outputs beyond 2,000 words in length. Meanwhile, analysis of user interaction logs reveals that over 1% of user prompts explicitly request outputs exceeding this limit, highlighting a pressing need in current research to overcome this limitation.
To address this, LongWriter introduces AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, enabling off-the-shelf LLMs to generate coherent outputs exceeding 20,000 words. Leveraging AgentWrite, LongWriter constructs LongWriter-6k, a dataset containing 6,000 SFT data samples with output lengths ranging from 2k to 32k words. By incorporating this dataset into model training, LongWriter successfully scales the output length of existing models to over 10,000 words while maintaining output quality.
LongWriter also develops LongBench-Write, a comprehensive benchmark for evaluating ultra-long generation capabilities. The 9B parameter model, further improved through DPO, achieves state-of-the-art performance on this benchmark, surpassing even much larger proprietary models.
In this article, we will discuss the LongWriter framework, explore its architecture, and compare its performance against state-of-the-art long-context large language models. Let’s get started.
Recent advancements in long context large language models (LLMs) have led to the creation of models with significantly increased memory capacities, capable of processing histories that exceed 100,000 tokens. Despite this ability to handle extensive inputs, current long-context LLMs struggle to generate outputs of comparable length. To investigate this limitation, LongWriter examines the maximum output length of state-of-the-art long-context models through various queries that require different response lengths, such as “Write a 10,000-word article on the history of the Roman Empire.” Based on the findings, LongWriter observes that all models consistently fail to generate outputs longer than 2,000 words. Furthermore, an analysis of user interaction logs indicates that over 1% of user prompts specifically request outputs beyond this limit, highlighting an urgent need in current research to address this issue.
LongWriter’s study reveals a key insight: the constraint on output length is primarily rooted in the characteristics of the Supervised Fine-Tuning (SFT) datasets. Specifically, LongWriter finds that a model’s maximum generation length is effectively capped by the upper limit of output lengths present in its SFT dataset, despite its exposure to much longer sequences during the pretraining phase. This finding explains the ubiquitous 2,000-word generation limit across current models, as existing SFT datasets rarely contain examples exceeding this length. Furthermore, as many datasets are distilled from state-of-the-art LLMs, they also inherit the output length limitation from their source models.
To address this limitation, LongWriter introduces AgentWrite, a novel agent-based pipeline designed to leverage off-the-shelf LLMs to automatically construct extended, coherent outputs. AgentWrite operates in two stages: First, it crafts a detailed writing plan outlining the structure and target word count for each paragraph based on the user’s input. Then, following this plan, it prompts the model to generate content for each paragraph in a sequential manner. LongWriter’s experiments validate that AgentWrite can produce high-quality and coherent outputs of up to 20,000 words.
Building upon the AgentWrite pipeline, LongWriter leverages GPT-4o to generate 6,000 long-output SFT data, named LongWriter-6k, and adds this data to train existing models. Notably, LongWriter-6k successfully unlocks the model’s ability to generate well-structured outputs exceeding 10,000 words in length. To rigorously evaluate the effectiveness of this approach, LongWriter develops the LongBench-Write benchmark, which contains a diverse set of user writing instructions, with output length specifications ranging from 0-500 words, 500-2,000 words, 2,000-4,000 words, and beyond 4,000 words. Evaluation on LongBench-Write shows that LongWriter’s 9B size model achieves state-of-the-art performance, even compared to larger proprietary models. LongWriter further constructs preference data and uses DPO to help the model better follow long writing instructions and generate higher quality written content, which has also been proven effective through experiments.
To summarize, LongWriter’s work makes the following novel contributions:
- Analysis of Generation Length Limits: LongWriter identifies the primary factor limiting the output length of current long-context LLMs, which is the constraint on the output length in the SFT data.
- AgentWrite: To overcome this limitation, LongWriter proposes AgentWrite, which uses a divide-and-conquer approach with off-the-shelf LLMs to automatically construct SFT data with ultra-long outputs. Using this method, LongWriter constructs the LongWriter-6k dataset.
- Scaling Output Window Size of Current LLMs: LongWriter incorporates the LongWriter-6k dataset into its SFT data, successfully scaling the output window size of existing models to 10,000+ words without compromising output quality. LongWriter shows that DPO further enhances the model’s long-text writing capabilities.
AgentWrite: Automatic Data Construction
To utilize off-the-shelf LLMs for automatically generating SFT data with longer outputs, LongWriter designs AgentWrite, a divide-and-conquer style agent pipeline. AgentWrite first breaks down long writing tasks into multiple subtasks, with each subtask requiring the model to write only one paragraph. The model then executes these subtasks sequentially, and LongWriter concatenates the subtask outputs to obtain the final long output. Such an approach of breaking down a complex task into multiple subtasks using LLM agents has already been applied in various fields, such as problem-solving, software development, and model evaluation. LongWriter’s work is the first to explore integrating planning to enable models to complete complex long-form writing tasks. Each step of AgentWrite is introduced in detail below.
Step I: Plan
Inspired by the thought process of human writers, who typically start by making an overall plan for long writing tasks, LongWriter utilizes the planning capabilities of LLMs to output such a writing outline given a writing instruction. This plan includes the main content and word count requirements for each paragraph. The prompt used by LongWriter is as follows:
“I need you to help me break down the following long-form writing instruction into multiple subtasks. Each subtask will guide the writing of one paragraph in the essay and should include the main points and word count requirements for that paragraph. The writing instruction is as follows: {User Instruction}. Please break it down in the following format, with each subtask taking up one line:
Paragraph 1 – Main Point: [Describe the main point of the paragraph, in detail] – Word Count: [Word count requirement, e.g., 400 words]
Paragraph 2 – Main Point: [Describe the main point of the paragraph, in detail] – Word Count: [Word count requirement, e.g. 1000 words].Make sure that each subtask is clear and specific, and that all subtasks cover the entire content of the writing instruction. Do not split the subtasks too finely; each subtask’s paragraph should be no less than 200 words and no more than 1000 words. Do not output any other content.”
Step II: Write
After obtaining the writing plan from Step I, LongWriter calls the LLM serially to complete each subtask, generating the writing content section by section. To ensure the coherence of the output, when LongWriter calls the model to generate the n-th section, the previously generated n−1 sections are also input, allowing the model to continue writing the next section based on the existing writing history. Although this serial manner prevents parallel calls to the model to complete multiple subtasks simultaneously, and the input length becomes longer, LongWriter shows in validation that the overall coherence and quality of the writing obtained this way are far superior to the output generated in parallel. The prompt in use by LongWriter is:
“You are an excellent writing assistant. I will give you an original writing instruction and my planned writing steps. I will also provide you with the text I have already written. Please help me continue writing the next paragraph based on the writing instruction, writing steps, and the already written text.
Writing instruction:
{User Instruction}
Writing steps:
{The writing plan generated in Step I}
Already written text:
{Previous generated (n-1) paragraphs}
Please integrate the original writing instruction, writing steps, and the already written text, and now continue writing {The plan for the n-th paragraph, i.e., the n-th line in the writing plan}.”
Validation
LongWriter tests the generation length and quality of the proposed AgentWrite method on two long-form writing datasets. The first one, LongWrite-Ruler, is used to measure exactly how long of an output the method can provide. The second, LongBench-Write, is mainly used to evaluate how well the model-generated content aligns with user instructions in terms of length and writing quality.
LongBench-Write: To evaluate the model’s performance on a more diverse range of long-form writing instructions, LongWriter collects 120 varied user writing prompts, with 60 in Chinese and 60 in English. To better assess whether the model’s output length meets user requirements, LongWriter ensures that all these instructions include explicit word count requirements. These instructions are divided into four subsets based on the word count requirements: 0-500 words, 500-2,000 words, 2,000-4,000 words, and over 4,000 words. Additionally, the instructions are categorized into seven types based on the output type: Literature and Creative Writing, Academic and Monograph, Popular Science, Functional Writing, News Report, Community Forum, and Education and Training.
During evaluation, LongWriter adopts two metrics: one for scoring the output length and another for scoring the output quality. The model’s output length is scored based on how close it is to the requirements specified in the instructions. For output quality, LongWriter uses the LLM-as-a-judge approach, selecting the state-of-the-art GPT-4o model to score the output across six dimensions: Relevance, Accuracy, Coherence, Clarity, Breadth and Depth, and Reading Experience. The final score is computed by averaging the length score and the quality score.
Validation results: LongWriter presents the output length measurement on LongWrite-Ruler and finds that AgentWrite successfully extends the output length of GPT-4o from a maximum of 2k words to approximately 20k words. LongWriter also assesses both the output quality and adherence to the required output length on LongBench-Write, showing that GPT-4o can successfully complete tasks with outputs under 2,000 words in length when evaluating AgentWrite’s performance.
Supervised Fine-Tuning
LongWriter conducts training based on two of the latest open-source models, namely GLM-4-9B and Llama-3.1-8B. Both of these are base models and support a context window of up to 128k tokens, making them naturally suitable for training on long outputs. To make the training more efficient, LongWriter adopts packing training with loss weighting. The training on the two models results in two models: LongWriter-9B (abbreviated for GLM-4-9B-LongWriter) and LongWriter-8B (abbreviated for Llama-3.1-8B-LongWriter).
At the same time, LongWriter notices that if the loss is averaged by sequence, i.e., taking the mean of each sequence’s average loss within a batch, the contribution of each target token to the loss in long output data would be significantly less than those with shorter outputs. In LongWriter’s experiments, it is also found that this leads to suboptimal model performance on tasks with long outputs. Therefore, LongWriter chooses a loss weighting strategy that averages the loss by token, where the loss is computed as the mean of losses across all target tokens within that batch.
All models are trained using a node with 8xH800 80G GPUs and DeepSpeed+ZeRO3+CPU offloading. LongWriter uses a batch size of 8, a learning rate of 1e-5, and a packing length of 32k. The models are trained for 4 epochs, which takes approximately 2,500-3,000 steps.
Alignment (DPO)
To further improve the model’s output quality and enhance its ability to follow length constraints in instructions, LongWriter performs direct preference optimization (DPO) on the supervised fine-tuned LongWriter-9B model. The DPO data comes from GLM-4’s chat DPO data (approximately 50k entries). Additionally, LongWriter constructs 4k pairs of data specifically targeting long-form writing instructions. For each writing instruction, LongWriter samples 4 outputs from LongWriter-9B and scores these outputs following a specific method. A length-following score is also combined as computed. The highest-scoring output is then selected as the positive sample, and one of the remaining three outputs is randomly chosen as the negative sample.
The resulting model, LongWriter-9B-DPO, is trained for 250 steps on the above data mixture. LongWriter follows a specific recipe for DPO training.
LongWriter: Experiments and Results
LongWriter evaluates 4 proprietary models and 5 open-source models on LongBench-Write, along with the trained LongWriter models. To the best of LongWriter’s knowledge, Suri-IORPO is the only prior model that is also aligned for long-form text generation. It is trained based on Mistral-7B-Instruct-v0.2 using LoRA. Consistent with the evaluation setup on LongWrite-Ruler, LongWriter sets the output temperature to 0.5 and configures the model’s generation max tokens parameter to the maximum allowed by its API call. For open-source models, it is set to 32,768.
Most previous models are unable to meet the length requirement of over 2,000 words, while LongWriter models consistently provide longer and richer responses to such prompts.
Observing the output length score SlS_lSl for prompts in each required length range, LongWriter finds that previous models generally perform poorly (scoring below 70) on prompts in the [2k, 4k) range, with only Claude 3.5 Sonnet achieving a decent score. For prompts in the [4k, 20k) range, almost all previous models are completely unable to reach the target output length, even scoring 0 (meaning all output lengths are less than one-third of the required length). By adding training data from LongWriter-6k, LongWriter’s trained model can effectively reach the required output length while maintaining good quality, as suggested by the scores in the [2k, 20k) range and the scatter plots.
DPO effectively improves both the model’s output quality and its ability to follow length requirements in long generation.
By comparing the scores of LongWriter-9B and LongWriter9B-DPO, we find that DPO significantly improves both Sl (+4%) and Sq (+3%) scores, and the improvement is consistent across all ranges. This shows that in long generation scenario, DPO still helps to improve the model’s output quality and can better align the model’s output length with 8 Preprint Figure 7: Cumulative average NLL loss of GLM4-9B and Llama-3.1-8B at different positions of LongWriter models’ outputs. Figure 8: LongWrite-Ruler test results of LongWriter models, showing their maximum generation lengths between 10k-20k words. the requested length. The latter conclusion has also been recently observed in Yuan et al. (2024) in shorter generations. We also manually annotate pairwise wins and losses for GPT-4o and three longwriter models on their outputs in LongBench-Write and visualize the results in Figure 9. We can see that humans prefer the DPO-trained model over LongWriter-9B in 58% of the cases. Moreover, despite having fewer parameters, LongWriter-9B-DPO achieves a tie with GPT-4o.
The output length limit of the LongWriter models is extended to between 10k and 20k words, while more data with long outputs is required to support even longer outputs.
Following the LongWrite-Ruler tes,we also present the LongWrite-Ruler test results of LongWriter models. The results suggest that their maximum generation lengths are between 10k-20k words. The lack of SFT data with longer outputs is likely the primary reason preventing the model from achieving longer output lengths.
Final Thoughts
In this work, we have talked about LongWriter, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, identifies a 2,000-word generation limit for current LLMs and proposes increasing their output window size by adding long-output data during alignment. To automatically construct long-output data, LongWriter develops AgentWrite, an agent-based pipeline that uses off-the-shelf LLMs to create extended, coherent outputs. LongWriter successfully scales the output window size of current LLMs to over 10,000 words with the constructed LongWriter-6k. Extensive ablation studies on the training data demonstrate the effectiveness of this approach. For future work, LongWriter suggests the following three directions: 1. Expand the AgentWrite framework to construct data with longer outputs to further extend LLMs’ output window size. 2. Refine the AgentWrite framework to achieve higher quality long-output data. 3. Longer model outputs bring challenges to inference efficiency. Several methods have been proposed to improve inference efficiency. It is worth investigating how these methods can ensure improved model efficiency without compromising generation quality.