Qwen 2.5 72B Instruct Turbo and Claude 3.5 Sonnet 20241022 are two leading AI models in the field of programming and code generation. Qwen 2.5 is recognized for its precision and code reliability, while Claude 3.5 Sonnet shines in its nuanced understanding and contextual adaptability.
In this article, we will compare the code generated by these two models, focusing on syntax quality, structural coherence, and overall performance. Our in-depth look at their coding capabilities will help you decide which model best suits your programming needs.
Benchmarks and specs
Specs
This is a comparison of two newest language models from QwenLM and Anthropic AI.
Specification |
Qwen2.5-72B |
Claude 3.5 Sonnet |
Input Context Window |
128K |
200K
|
Maximum Output Tokens |
8K |
8K |
Number of parameters in the LLM |
72.7B
|
- |
Knowledge cutoff |
September 2024
|
April 2024 |
Release Date |
September 19, 2024 |
October 22, 2024
|
Output tokens per second |
38.4 |
80
|
The primary distinctions between Qwen 2.5 and Claude 3.5 Sonnet lie in their input context windows and output token rates. Qwen 2.5 provides a 128K token input context window, while Claude 3.5 Sonnet extends this to a larger 200K tokens. Both models have a maximum output token capacity of 8K. In terms of processing speed, Claude 3.5 Sonnet leads with an output rate of 80 tokens per second, compared to Qwen 2.5’s rate of 38.4 tokens per second.
The models also differ in terms of knowledge updates: Qwen 2.5 has a cutoff date of September 2024, whereas Claude 3.5 Sonnet's knowledge is up to date as of April 2024. The release dates for these models are close, with Qwen 2.5 launching on September 19, 2024, and Claude 3.5 Sonnet following on October 22, 2024. Additionally, Qwen 2.5 has 72.7 billion parameters in its language model, while the parameter count for Claude 3.5 Sonnet is not specified.
Benchmarks
This benchmark is the combination of official release notes for both models (Qwen2.5 and Claude 3.5 Sonnet), as well as multiple open benchmarks.
Benchmark |
Qwen2.5-72B |
Claude 3.5 Sonnet |
Undergraduate level knowledge
MMLU
|
71.1 |
78.0
|
Graduate level reasoning
GPQA
|
49.0 |
65.0
|
Code
Human Eval
|
86.6 |
93.7
|
Math problem-solving
MATH
|
83.1
|
78.3 |
Reasoning
DROP, F1
|
- |
88.3
|
Benchmark results indicate that Claude 3.5 Sonnet excels over Qwen 2.5 in undergraduate knowledge, graduate-level reasoning, general reasoning, and code generation, while Qwen 2.5 holds an advantage in math problem-solving.
In this article, we’ll focus on each model's performance in coding tasks to help determine which is better suited for specific use cases.
Practical tests
1. Algorithms
Convert a non-negative integer num to its English words representation
Example:
Input: num = 123
Output: "One Hundred Twenty Three"
Qwen2.5-72B output
Claude 3.5 output
def number_to_words(num):
if num == 0:
return "Zero"
def one(num):
switcher = {
1: 'One', 2: 'Two', 3: 'Three', 4: 'Four', 5: 'Five',
6: 'Six', 7: 'Seven', 8: 'Eight', 9: 'Nine'
}
return switcher.get(num)
def two_less_20(num):
switcher = {
10: 'Ten', 11: 'Eleven', 12: 'Twelve', 13: 'Thirteen', 14: 'Fourteen',
15: 'Fifteen', 16: 'Sixteen', 17: 'Seventeen', 18: 'Eighteen', 19: 'Nineteen'
}
return switcher.get(num)
def ten(num):
switcher = {
2: 'Twenty', 3: 'Thirty', 4: 'Forty', 5: 'Fifty',
6: 'Sixty', 7: 'Seventy', 8: 'Eighty', 9: 'Ninety'
}
return switcher.get(num)
def two(num):
if not num:
return ''
elif num < 10:
return one(num)
elif num < 20:
return two_less_20(num)
else:
tenner = num // 10
rest = num - tenner * 10
return ten(tenner) + (' ' + one(rest) if rest else '')
def three(num):
hundred = num // 100
rest = num - hundred * 100
if hundred and rest:
return one(hundred) + ' Hundred ' + two(rest)
elif hundred and not rest:
return one(hundred) + ' Hundred'
else:
return two(rest)
billion = num // 1000000000
million = (num - billion * 1000000000) // 1000000
thousand = (num - billion * 1000000000 - million * 1000000) // 1000
remainder = num - billion * 1000000000 - million * 1000000 - thousand * 1000
result = ''
if billion:
result += three(billion) + ' Billion'
if million:
result += ' ' if result else ''
result += three(million) + ' Million'
if thousand:
result += ' ' if result else ''
result += three(thousand) + ' Thousand'
if remainder:
result += ' ' if result else ''
result += three(remainder)
return result
def numberToWords(num: int) -> str:
# Handle zero case separately
if num == 0:
return "Zero"
# Helper function to convert numbers less than 1000
def helper(n):
under20 = ["", "One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine",
"Ten", "Eleven", "Twelve", "Thirteen", "Fourteen", "Fifteen", "Sixteen",
"Seventeen", "Eighteen", "Nineteen"]
tens = ["", "", "Twenty", "Thirty", "Forty", "Fifty", "Sixty", "Seventy", "Eighty", "Ninety"]
if n == 0:
return []
elif n < 20:
return [under20[n]]
elif n < 100:
return [tens[n//10]] + helper(n%10)
else:
return [under20[n//100], "Hundred"] + helper(n%100)
# Split number into groups of three digits
billion = num // 1000000000
million = (num % 1000000000) // 1000000
thousand = (num % 1000000) // 1000
rest = num % 1000
result = []
if billion:
result.extend(helper(billion))
result.append("Billion")
if million:
result.extend(helper(million))
result.append("Million")
if thousand:
result.extend(helper(thousand))
result.append("Thousand")
if rest:
result.extend(helper(rest))
return " ".join(result)
Full answer
Full answer
Correct Answer: ✅ (5/5)
Qwen provided the best possible solution
Correct Answer: ✅ (4/5)
Claude provided a solid solution, but it falls short compared to Qwen's.
2. Debugging
Find all the bugs that are in this code
and that can be encountered and write a fully working code
products = {"Bread": 2.0, "Tea": 2.5, "Cookie": 5.2}
for item in products.keys()
income = 0
qty = input(f"How many {item}s have you sold? ")
income = income + qty * products[item]
print(f"\nThe income today was ${income:0.2f}")
Qwen2.5-72B output
Claude 3.5 output
products = {"Bread": 2.0, "Tea": 2.5, "Cookie": 5.2}
income = 0
for item in products.keys():
qty = input(f"How many {item}s have you sold? ")
try:
qty = int(qty) # Convert input to integer
income += qty * products[item] # Accumulate income
except ValueError:
print(f"Invalid input for {item}. Please enter a valid number.")
continue
print(f"The income today was ${income:0.2f}")
products = {"Bread": 2.0, "Tea": 2.5, "Cookie": 5.2}
income = 0
for item in products.keys():
while True:
try:
qty = float(input(f"How many {item}s have you sold? "))
if qty >= 0: # Check for negative numbers
income = income + qty * products[item]
break
else:
print("Please enter a positive number.")
except ValueError:
print("Please enter a valid number.")
print(f"The income today was ${income:.2f}")
Full answer
Full answer
Correct Answer: ✅ (4/5)
Advantages of Qwen:
- Qwen has a simpler code and may be sufficient for scenarios where basic error handling is acceptable. However, this simplicity comes at the cost of robustness.
Correct Answer: ✅ (4/5)
Advantages of Claude:
- Uses a while True loop to repeatedly prompt the user until a valid, non-negative number is entered. This ensures the program doesn't continue with invalid input
- Checks if the input is a non-negative number (qty >= 0), which prevents the entry of negative quantities that don't make sense in the context of selling products.
- Provides more specific feedback if the input is invalid or negative, guiding the user to enter a correct value. This helps improve the user experience by clearly stating what went wrong.
- Converts the input to a float rather than an int, allowing for fractional quantities. This is useful if selling items in units smaller than 1 (e.g., 1.5 kilograms of a product), making the program more flexible.
3. Understanding of libraries and frameworks
Generate synthetic data for points in 2D space
and classify them using a machine learning model.
Steps:
1. Using numpy, generate 100 points for each class with
different means and standard deviations for x and y coordinates
2. Using pandas DataFrame, create a dataset with x, y,
and class columns, then shuffle it
3. Visualize the points using matplotlib
4. Using scikit-learn, train the model
5. Calculate evaluation metrics such as precision,
accuracy, recall, and F1-score
Qwen2.5-72B output
Claude 3.5 output
Full answer
Full answer
Correct Answer: ✅ (4/5)
- Used Logistic Regression for binary classification.
- Created two classes with Gaussian distributions.
- Computed metrics such as accuracy, precision, recall, and F1-score.
- Constructed a simple scatterplot to visualize the distribution of points.
Correct Answer: ✅ (5/5)
- Used k-nearest neighbors (k-NN) for classification.
- Created two classes with Gaussian distributions.
- Scaled features using StandardScaler, which is important for k-NN to work well since it relies on distance calculations.
- Used confusion matrix and classification report to evaluate the model.
- Included visualizations of the confusion matrix using seaborn, which provides a better understanding of the model performance.
Both models offer code suitable for various scenarios, but Claude demonstrates a more skillful use of the libraries.
4. Refactoring and optimization
Optimize and refactor this code:
async def __get_query_sort_items(self, query, type_sort):
try:
if type_sort == 'desc_price':
query = query.order_by(self.model.price.desc())
elif type_sort == 'asc_price':
query = query.order_by(self.model.price)
elif type_sort == 'best':
query = query.order_by(func.avg(Reviews.rate).desc())
elif type_sort == 'popular':
query = query.order_by(func.count(purchase_item_user.c.item_id).desc())
except (SQLAlchemyError, Exception) as e:
msg = 'Database' if isinstance(e, SQLAlchemyError) else 'Unknown'
msg += ' Exc. Cannot get query sort items'
extra = {
'type_sort': type_sort,
}
logger.error(msg, extra=extra, exc_info=True)
return query
Qwen2.5-72B output
Claude 3.5 output
from sqlalchemy import func
from sqlalchemy.exc import SQLAlchemyError
async def __get_query_sort_items(self, query, type_sort):
try:
sort_map = {
'desc_price': self.model.price.desc(),
'asc_price': self.model.price,
'best': func.avg(Reviews.rate).desc(),
'popular': func.count(purchase_item_user.c.item_id).desc()
}
if type_sort in sort_map:
query = query.order_by(sort_map[type_sort])
else:
logger.warning(f"Unknown sort type: {type_sort}")
except SQLAlchemyError as e:
logger.error('Database Exc. Cannot get query sort items', extra={'type_sort': type_sort}, exc_info=True)
except Exception as e:
logger.error('Unknown Exc. Cannot get query sort items', extra={'type_sort': type_sort}, exc_info=True)
return query
from enum import Enum
from typing import Optional
from sqlalchemy import func
from sqlalchemy.exc import SQLAlchemyError
class SortType(Enum):
DESC_PRICE = 'desc_price'
ASC_PRICE = 'asc_price'
BEST = 'best'
POPULAR = 'popular'
class QuerySorter:
SORT_MAPPINGS = {
SortType.DESC_PRICE: lambda model: model.price.desc(),
SortType.ASC_PRICE: lambda model: model.price,
SortType.BEST: lambda _: func.avg(Reviews.rate).desc(),
SortType.POPULAR: lambda _: func.count(purchase_item_user.c.item_id).desc()
}
@staticmethod
async def sort_query(query, type_sort: str, model) -> Optional[Query]:
"""
Sort a query based on the specified sort type.
Args:
query: The base query to sort
type_sort: The type of sorting to apply
model: The model class containing the sorting fields
Returns:
Sorted query or None if an error occurs
"""
try:
sort_type = SortType(type_sort)
sort_function = QuerySorter.SORT_MAPPINGS.get(sort_type)
if sort_function:
return query.order_by(sort_function(model))
return query
except (SQLAlchemyError, ValueError) as e:
error_type = 'Database' if isinstance(e, SQLAlchemyError) else 'Invalid sort type'
logger.error(
f'{error_type} error: Cannot sort items',
extra={'type_sort': type_sort},
exc_info=True
)
return query # Return original query on error
Full answer
Full answer
Correct Answer: ✅ (4/5)
Changes and Improvements:
- Dictionary Mapping: Used a dictionary `sort_map` to map `type_sort` values to their corresponding sorting expressions. This makes the code more readable and maintainable.
- Error Handling: Simplified the error handling by separating the `SQLAlchemyError` and `Exception` cases.
- Logging: Added a warning log for unknown sort types to provide feedback if an invalid `type_sort` is provided.
- Code Readability: Reduced the number of `if-elif` statements, making the code cleaner and easier to understand.
Correct Answer: ✅ (5/5)
Key improvements made:
- Enum Class: Added `SortType` enum for better type safety and maintainability.
- Mapping Dictionary: Created a `SORT_MAPPINGS` dictionary to map sort types to their corresponding sort functions, eliminating the need for multiple if-elif statements.
- Type Hints: Added type hints for better code documentation and IDE support.
- Better Error Handling:
- Separated database errors from invalid sort type errors
- More specific error messages
- Returns the original query instead of potentially undefined behavior
- Documentation: Added docstring explaining the method's purpose and parameters
- Static Method: Made the method static since it doesn't use instance attributes
See how claude handled another refactoring compared to GPT-o1 mini: GPT-o1 mini vs Claude 3.5 Sonnet
Claude provided better code, as it incorporates several improvements that enhance the code's maintainability, readability, and error handling
5. Fullstack web application
Develop a fullstack web application where users can play a word-guessing game.
The game involves guessing a word letter by letter, similar to "Hangman."
The application should have a React-based frontend and a FastAPI backend,
ensuring a smooth and interactive user experience.
Frontend Requirements (React):
- A simple and intuitive interface
- A display showing the number of attempts remaining
- A button to start a new game.
- Track the game state (ongoing, won, lost) and display appropriate messages.
- Reset the game state when starting a new game.
API Endpoints:
- Start a new game
- Check a letter guess
- Get game state
Game Logic on the Server Side:
- Maintain the game state for each session,
including the randomly chosen word and guesses.
- Ensure game state consistency across API calls.
Qwen2.5-72B output
Claude 3.5 output
Full answer
Full answer
Correct Answer: ❌ (2/5)
Unfortunately, the model was unable to produce working code, even after multiple requests to fix the non-functional parts. An error consistently appears when clicking on any of the letters, which Qwen was unable to resolve.
Correct Answer: ✅ (5/5)
Claude successfully wrote working code on the first attempt.
6. Writing documentation
I will provide the code written by Claude in the previous test and ask both models to create documentation for it.
Write detailed documentation for my application:
main.py
...
App.js
...
Qwen2.5-72B output
Claude 3.5 output
# Application Documentation
## Overview
...
## Backend (FastAPI)
### File: `main.py`
#### Dependencies
- `FastAPI`: A modern, fast (high-performance)
web framework for building APIs with
Python 3.7+ based on standard Python type hints.
...
### Code Explanation
...
## Frontend (React)
### File: `App.js`
...
### Code Explanation
...
## Running the Application
### Backend
...
### Frontend
...
## Conclusion
...
# Word Guessing Game Documentation
## Overview
...
## Backend (FastAPI)
### Setup and Configuration
...
### Models
...
### API Endpoints
...
### Game Logic
...
## Frontend (React)
### Components
...
### State Management
...
### Key Functions
...
### User Interface Elements
...
### Error Handling
...
## Setup Instructions
### Backend
...
### Frontend
...
## Technical Requirements
- Python 3.6+
...
## Notes
...
Full answer
Full answer
Correct Answer: ✅ (5/5)
The documentation provided adheres fully to the instructions and includes a detailed description of each point.
Correct Answer: ✅ (4/5)
Claude has provided good documentation; however, the minimum required Python version to run this app is 3.7.
7. 3D game
Create a first-person game where the player stands still
and shoots at targets that appear one at a time.
- Use the mouse to control the player.
- The player earns points for each successful hit.
- If he misses, he loses one of his three lives.
- For every 15 successful hits, the player gets an extra life.
- Use Python and the Ursina library
Qwen2.5-72B output
Claude 3.5 output
Full answer
Full answer
Correct Answer: ✅ (3/5)
A good result, but it was only possible to achieve it after several improvements.
Disadvantages:
- the game immediately closes after losing all lives
- text is difficult to see
- targets appear in the same places
Correct Answer: ✅ (4/5)
A good result, but it was only possible to achieve it after several improvements.
Disadvantages:
- difficult to navigate in space
See how Claude 3.5 Sonnet performed in other games compared to ChatGPT 4o: Claude Sonnet 3.5 VS ChatGPT 4o
Pricing
1K Tokens |
Qwen2.5-72B |
Claude 3.5 Sonnet |
Input price |
$0.00126 |
$0.00315 |
Output price |
$0.00126 |
$0.01575 |
Conclusion
Strengths and Weaknesses of Each Model
Qwen 2.5
Strengths:
- Algorithms: Excels in providing optimized solutions for algorithmic problems, often delivering the best possible answers.
- Documentation: Delivers comprehensive, well-structured documentation.
Weaknesses:
- Dynamic Applications: Struggles with complex, interactive projects like full-stack applications and 3D games, occasionally unable to resolve functionality issues.
- Advanced Error Handling: Simplified debugging lacks the robustness of Claude's approach, making it less suited for complex, user-centered error handling.
- Library Utilization: Does not demonstrate the same level of skillful library use as Claude, particularly in machine learning and data science applications.
Claude 3.5 Sonnet
Strengths:
- Dynamic Coding Applications: Excels in handling complex projects, like full-stack applications and 3D games, with a high success rate on the first attempt.
- Debugging and Error Handling: Employs advanced error handling strategies that offer more robust solutions, addressing edge cases and improving user experience.
- Library Mastery: Demonstrates skillful use of libraries and frameworks, especially in machine learning, including visualizations and feature scaling.
- Refactoring and Optimization: Delivers more maintainable and readable code with enums, type hints, and static methods for improved structure and reliability.
Weaknesses:
- Algorithmic Challenges: Occasionally provides less optimized solutions for algorithmic and math-focused tasks.
- Documentation Details: Although it creates solid documentation, it sometimes omits minor details, such as version requirements, that impact usability
Best Use Cases
When to Use Qwen 2.5:
- Algorithm Development and Math Tasks: Ideal for tasks that require computational precision, algorithm optimization, and problem-solving in structured scenarios.
- Basic Debugging and Documentation Needs: Suitable for projects that prioritize straightforward debugging and require clear documentation without extensive setup details.
- Optimized Solutions: Well-suited for projects where code efficiency and clarity are more critical than interactive features.
When to Use Claude 3.5 Sonnet:
- Dynamic Projects and Full-Stack Development: Ideal for complex, user-centered applications such as web or 3D game development.
- Advanced Debugging and Error Handling: The best choice for projects needing robust handling of user input and error prevention.
- Library and Framework Utilization: Perfect for tasks involving up-to-date libraries, data processing, or machine learning applications, where library mastery can significantly enhance output quality.