November 1, 2024

Best AI for coding. Qwen 2.5 Vs Claude 3.5 Sonnet comparison

‍Qwen 2.5 72B Instruct Turbo and Claude 3.5 Sonnet 20241022 are two leading AI models in the field of programming and code generation. Qwen 2.5 is recognized for its precision and code reliability, while Claude 3.5 Sonnet shines in its nuanced understanding and contextual adaptability.

In this article, we will compare the code generated by these two models, focusing on syntax quality, structural coherence, and overall performance. Our in-depth look at their coding capabilities will help you decide which model best suits your programming needs.

‍

Benchmarks and specs

Specs

This is a comparison of two newest language models from QwenLM and Anthropic AI.

Specification	Qwen2.5-72B	Claude 3.5 Sonnet
Input Context Window	128K	200K
Maximum Output Tokens	8K	8K
Number of parameters in the LLM	72.7B	-
Knowledge cutoff	September 2024	April 2024
Release Date	September 19, 2024	October 22, 2024
Output tokens per second	38.4	80

The primary distinctions between Qwen 2.5 and Claude 3.5 Sonnet lie in their input context windows and output token rates. Qwen 2.5 provides a 128K token input context window, while Claude 3.5 Sonnet extends this to a larger 200K tokens. Both models have a maximum output token capacity of 8K. In terms of processing speed, Claude 3.5 Sonnet leads with an output rate of 80 tokens per second, compared to Qwen 2.5’s rate of 38.4 tokens per second.

The models also differ in terms of knowledge updates: Qwen 2.5 has a cutoff date of September 2024, whereas Claude 3.5 Sonnet's knowledge is up to date as of April 2024. The release dates for these models are close, with Qwen 2.5 launching on September 19, 2024, and Claude 3.5 Sonnet following on October 22, 2024. Additionally, Qwen 2.5 has 72.7 billion parameters in its language model, while the parameter count for Claude 3.5 Sonnet is not specified.

Benchmarks

This benchmark is the combination of official release notes for both models (Qwen2.5 and Claude 3.5 Sonnet), as well as multiple open benchmarks.

Benchmark	Qwen2.5-72B	Claude 3.5 Sonnet
Undergraduate level knowledge MMLU	71.1	78.0
Graduate level reasoning GPQA	49.0	65.0
Code Human Eval	86.6	93.7
Math problem-solving MATH	83.1	78.3
Reasoning DROP, F1	-	88.3

Benchmark results indicate that Claude 3.5 Sonnet excels over Qwen 2.5 in undergraduate knowledge, graduate-level reasoning, general reasoning, and code generation, while Qwen 2.5 holds an advantage in math problem-solving.

In this article, we’ll focus on each model's performance in coding tasks to help determine which is better suited for specific use cases.

Practical tests

1. Algorithms

Convert a non-negative integer num to its English words representation

Example:
Input: num = 123
Output: "One Hundred Twenty Three"

‍

Qwen2.5-72B output

Claude 3.5 output

def number_to_words(num):
    if num == 0:
        return "Zero"
    
    def one(num):
        switcher = {
            1: 'One', 2: 'Two', 3: 'Three', 4: 'Four', 5: 'Five',
            6: 'Six', 7: 'Seven', 8: 'Eight', 9: 'Nine'
        }
        return switcher.get(num)
    
    def two_less_20(num):
        switcher = {
            10: 'Ten', 11: 'Eleven', 12: 'Twelve', 13: 'Thirteen', 14: 'Fourteen',
            15: 'Fifteen', 16: 'Sixteen', 17: 'Seventeen', 18: 'Eighteen', 19: 'Nineteen'
        }
        return switcher.get(num)
    
    def ten(num):
        switcher = {
            2: 'Twenty', 3: 'Thirty', 4: 'Forty', 5: 'Fifty',
            6: 'Sixty', 7: 'Seventy', 8: 'Eighty', 9: 'Ninety'
        }
        return switcher.get(num)
    
    def two(num):
        if not num:
            return ''
        elif num < 10:
            return one(num)
        elif num < 20:
            return two_less_20(num)
        else:
            tenner = num // 10
            rest = num - tenner * 10
            return ten(tenner) + (' ' + one(rest) if rest else '')
    
    def three(num):
        hundred = num // 100
        rest = num - hundred * 100
        if hundred and rest:
            return one(hundred) + ' Hundred ' + two(rest)
        elif hundred and not rest:
            return one(hundred) + ' Hundred'
        else:
            return two(rest)
    
    billion = num // 1000000000
    million = (num - billion * 1000000000) // 1000000
    thousand = (num - billion * 1000000000 - million * 1000000) // 1000
    remainder = num - billion * 1000000000 - million * 1000000 - thousand * 1000
    
    result = ''
    if billion:
        result += three(billion) + ' Billion'
    if million:
        result += ' ' if result else ''
        result += three(million) + ' Million'
    if thousand:
        result += ' ' if result else ''
        result += three(thousand) + ' Thousand'
    if remainder:
        result += ' ' if result else ''
        result += three(remainder)
    
    return result

def numberToWords(num: int) -> str:
    # Handle zero case separately
    if num == 0:
        return "Zero"
    
    # Helper function to convert numbers less than 1000
    def helper(n):
        under20 = ["", "One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine",
                   "Ten", "Eleven", "Twelve", "Thirteen", "Fourteen", "Fifteen", "Sixteen",
                   "Seventeen", "Eighteen", "Nineteen"]
        tens = ["", "", "Twenty", "Thirty", "Forty", "Fifty", "Sixty", "Seventy", "Eighty", "Ninety"]
        
        if n == 0:
            return []
        elif n < 20:
            return [under20[n]]
        elif n < 100:
            return [tens[n//10]] + helper(n%10)
        else:
            return [under20[n//100], "Hundred"] + helper(n%100)
    
    # Split number into groups of three digits
    billion = num // 1000000000
    million = (num % 1000000000) // 1000000
    thousand = (num % 1000000) // 1000
    rest = num % 1000
    
    result = []
    if billion:
        result.extend(helper(billion))
        result.append("Billion")
    if million:
        result.extend(helper(million))
        result.append("Million")
    if thousand:
        result.extend(helper(thousand))
        result.append("Thousand")
    if rest:
        result.extend(helper(rest))
        
    return " ".join(result)

Full answer Full answer

Correct Answer: ✅ (5/5)
Qwen provided the best possible solution

Correct Answer: ✅ (4/5)
Claude provided a solid solution, but it falls short compared to Qwen's.

2. Debugging

Find all the bugs that are in this code 
and that can be encountered and write a fully working code

products = {"Bread": 2.0, "Tea": 2.5, "Cookie": 5.2}
for item in products.keys()
     income = 0
      qty = input(f"How many {item}s have you sold? ")
     income = income + qty * products[item]
print(f"\nThe income today was ${income:0.2f}")

‍

Qwen2.5-72B output

Claude 3.5 output

products = {"Bread": 2.0, "Tea": 2.5, "Cookie": 5.2}
income = 0

for item in products.keys():
    qty = input(f"How many {item}s have you sold? ")
    try:
        qty = int(qty)  # Convert input to integer
        income += qty * products[item]  # Accumulate income
    except ValueError:
        print(f"Invalid input for {item}. Please enter a valid number.")
        continue

print(f"The income today was ${income:0.2f}")

products = {"Bread": 2.0, "Tea": 2.5, "Cookie": 5.2}
income = 0

for item in products.keys():
    while True:
        try:
            qty = float(input(f"How many {item}s have you sold? "))
            if qty >= 0:  # Check for negative numbers
                income = income + qty * products[item]
                break
            else:
                print("Please enter a positive number.")
        except ValueError:
            print("Please enter a valid number.")

print(f"The income today was ${income:.2f}")

Full answer Full answer

Correct Answer: ✅ (4/5)
Advantages of Qwen:

Qwen has a simpler code and may be sufficient for scenarios where basic error handling is acceptable. However, this simplicity comes at the cost of robustness.

Correct Answer: ✅ (4/5)
Advantages of Claude:

Uses a while True loop to repeatedly prompt the user until a valid, non-negative number is entered. This ensures the program doesn't continue with invalid input
Checks if the input is a non-negative number (qty >= 0), which prevents the entry of negative quantities that don't make sense in the context of selling products.
Provides more specific feedback if the input is invalid or negative, guiding the user to enter a correct value. This helps improve the user experience by clearly stating what went wrong.
Converts the input to a float rather than an int, allowing for fractional quantities. This is useful if selling items in units smaller than 1 (e.g., 1.5 kilograms of a product), making the program more flexible.

3. Understanding of libraries and frameworks

Generate synthetic data for points in 2D space 
and classify them using a machine learning model.

Steps:

  1. Using numpy, generate 100 points for each class with 
different means and standard deviations for x and y coordinates

  2. Using pandas DataFrame, create a dataset with x, y, 
and class columns, then shuffle it

  3. Visualize the points using matplotlib

  4. Using scikit-learn, train the model

  5. Calculate evaluation metrics such as precision, 
accuracy, recall, and F1-score

‍

Qwen2.5-72B output

Claude 3.5 output

Full answer Full answer

Correct Answer: ✅ (4/5)

Used Logistic Regression for binary classification.
Created two classes with Gaussian distributions.
Computed metrics such as accuracy, precision, recall, and F1-score.
Constructed a simple scatterplot to visualize the distribution of points.

Correct Answer: ✅ (5/5)

Used k-nearest neighbors (k-NN) for classification.
Created two classes with Gaussian distributions.
Scaled features using StandardScaler, which is important for k-NN to work well since it relies on distance calculations.
Used confusion matrix and classification report to evaluate the model.
Included visualizations of the confusion matrix using seaborn, which provides a better understanding of the model performance.

Both models offer code suitable for various scenarios, but Claude demonstrates a more skillful use of the libraries.

4. Refactoring and optimization

Optimize and refactor this code:

async def __get_query_sort_items(self, query, type_sort):
    try:
        if type_sort == 'desc_price':
            query = query.order_by(self.model.price.desc())
        elif type_sort == 'asc_price':
            query = query.order_by(self.model.price)
        elif type_sort == 'best':
            query = query.order_by(func.avg(Reviews.rate).desc())
        elif type_sort == 'popular':
            query = query.order_by(func.count(purchase_item_user.c.item_id).desc())
    except (SQLAlchemyError, Exception) as e:
        msg = 'Database' if isinstance(e, SQLAlchemyError) else 'Unknown'
        msg += ' Exc. Cannot get query sort items'
        extra = {
            'type_sort': type_sort,
        }
        logger.error(msg, extra=extra, exc_info=True)
    return query

‍

Qwen2.5-72B output

Claude 3.5 output

from sqlalchemy import func
from sqlalchemy.exc import SQLAlchemyError

async def __get_query_sort_items(self, query, type_sort):
    try:
        sort_map = {
            'desc_price': self.model.price.desc(),
            'asc_price': self.model.price,
            'best': func.avg(Reviews.rate).desc(),
            'popular': func.count(purchase_item_user.c.item_id).desc()
        }
        
        if type_sort in sort_map:
            query = query.order_by(sort_map[type_sort])
        else:
            logger.warning(f"Unknown sort type: {type_sort}")

    except SQLAlchemyError as e:
        logger.error('Database Exc. Cannot get query sort items', extra={'type_sort': type_sort}, exc_info=True)
    except Exception as e:
        logger.error('Unknown Exc. Cannot get query sort items', extra={'type_sort': type_sort}, exc_info=True)

    return query

from enum import Enum
from typing import Optional
from sqlalchemy import func
from sqlalchemy.exc import SQLAlchemyError

class SortType(Enum):
    DESC_PRICE = 'desc_price'
    ASC_PRICE = 'asc_price'
    BEST = 'best'
    POPULAR = 'popular'

class QuerySorter:
    SORT_MAPPINGS = {
        SortType.DESC_PRICE: lambda model: model.price.desc(),
        SortType.ASC_PRICE: lambda model: model.price,
        SortType.BEST: lambda _: func.avg(Reviews.rate).desc(),
        SortType.POPULAR: lambda _: func.count(purchase_item_user.c.item_id).desc()
    }

    @staticmethod
    async def sort_query(query, type_sort: str, model) -> Optional[Query]:
        """
        Sort a query based on the specified sort type.
        
        Args:
            query: The base query to sort
            type_sort: The type of sorting to apply
            model: The model class containing the sorting fields
            
        Returns:
            Sorted query or None if an error occurs
        """
        try:
            sort_type = SortType(type_sort)
            sort_function = QuerySorter.SORT_MAPPINGS.get(sort_type)
            
            if sort_function:
                return query.order_by(sort_function(model))
            return query

        except (SQLAlchemyError, ValueError) as e:
            error_type = 'Database' if isinstance(e, SQLAlchemyError) else 'Invalid sort type'
            logger.error(
                f'{error_type} error: Cannot sort items',
                extra={'type_sort': type_sort},
                exc_info=True
            )
            return query  # Return original query on error

Full answer Full answer

Correct Answer: ✅ (4/5)
Changes and Improvements:

Dictionary Mapping: Used a dictionary `sort_map` to map `type_sort` values to their corresponding sorting expressions. This makes the code more readable and maintainable.
Error Handling: Simplified the error handling by separating the `SQLAlchemyError` and `Exception` cases.
Logging: Added a warning log for unknown sort types to provide feedback if an invalid `type_sort` is provided.
Code Readability: Reduced the number of `if-elif` statements, making the code cleaner and easier to understand.

Correct Answer: ✅ (5/5)
Key improvements made:

Enum Class: Added `SortType` enum for better type safety and maintainability.
Mapping Dictionary: Created a `SORT_MAPPINGS` dictionary to map sort types to their corresponding sort functions, eliminating the need for multiple if-elif statements.
Type Hints: Added type hints for better code documentation and IDE support.
Better Error Handling:
- Separated database errors from invalid sort type errors
- More specific error messages
- Returns the original query instead of potentially undefined behavior
Documentation: Added docstring explaining the method's purpose and parameters
Static Method: Made the method static since it doesn't use instance attributes

See how claude handled another refactoring compared to GPT-o1 mini: GPT-o1 mini vs Claude 3.5 Sonnet

Claude provided better code, as it incorporates several improvements that enhance the code's maintainability, readability, and error handling

5. Fullstack web application

Develop a fullstack web application where users can play a word-guessing game. 
The game involves guessing a word letter by letter, similar to "Hangman." 
The application should have a React-based frontend and a FastAPI backend, 
ensuring a smooth and interactive user experience.

Frontend Requirements (React):
  - A simple and intuitive interface
  - A display showing the number of attempts remaining
  - A button to start a new game.
  - Track the game state (ongoing, won, lost) and display appropriate messages.
  - Reset the game state when starting a new game.
  
API Endpoints:
  - Start a new game
  - Check a letter guess
  - Get game state
  
Game Logic on the Server Side:
  - Maintain the game state for each session, 
  including the randomly chosen word and guesses.
  - Ensure game state consistency across API calls.

‍

Qwen2.5-72B output

Claude 3.5 output

Full answer Full answer

Correct Answer: ❌ (2/5)

Unfortunately, the model was unable to produce working code, even after multiple requests to fix the non-functional parts. An error consistently appears when clicking on any of the letters, which Qwen was unable to resolve.

Correct Answer: ✅ (5/5)

Claude successfully wrote working code on the first attempt.

6. Writing documentation

I will provide the code written by Claude in the previous test and ask both models to create documentation for it.

Write detailed documentation for my application:

main.py
...

App.js
...

‍

Qwen2.5-72B output

Claude 3.5 output

# Application Documentation

## Overview
...
## Backend (FastAPI)

### File: `main.py`

#### Dependencies
- `FastAPI`: A modern, fast (high-performance)
 web framework for building APIs with 
 Python 3.7+ based on standard Python type hints.
...
### Code Explanation
...
## Frontend (React)

### File: `App.js`
...
### Code Explanation
...
## Running the Application

### Backend
...
### Frontend
...
## Conclusion
...

# Word Guessing Game Documentation
## Overview
...
## Backend (FastAPI)

### Setup and Configuration
...
### Models
...
### API Endpoints
...
### Game Logic
...
## Frontend (React)

### Components
...
### State Management
...
### Key Functions
...
### User Interface Elements
...
### Error Handling
...
## Setup Instructions

### Backend
...
### Frontend
...
## Technical Requirements
- Python 3.6+
...
## Notes
...

Full answer Full answer

Correct Answer: ✅ (5/5)
The documentation provided adheres fully to the instructions and includes a detailed description of each point.

Correct Answer: ✅ (4/5)
Claude has provided good documentation; however, the minimum required Python version to run this app is 3.7.

7. 3D game

Create a first-person game where the player stands still 
and shoots at targets that appear one at a time.

- Use the mouse to control the player.
- The player earns points for each successful hit.
- If he misses, he loses one of his three lives.
- For every 15 successful hits, the player gets an extra life.
- Use Python and the Ursina library

‍

Qwen2.5-72B output

Claude 3.5 output

Full answer Full answer

Correct Answer: ✅ (3/5)
A good result, but it was only possible to achieve it after several improvements.

Disadvantages:

the game immediately closes after losing all lives
text is difficult to see
targets appear in the same places

Correct Answer: ✅ (4/5)
A good result, but it was only possible to achieve it after several improvements.

Disadvantages:

difficult to navigate in space

See how Claude 3.5 Sonnet performed in other games compared to ChatGPT 4o: Claude Sonnet 3.5 VS ChatGPT 4o

Pricing

1K Tokens	Qwen2.5-72B	Claude 3.5 Sonnet
Input price	$0.00126	$0.00315
Output price	$0.00126	$0.01575

Conclusion

Strengths and Weaknesses of Each Model

Qwen 2.5

Strengths:

Algorithms: Excels in providing optimized solutions for algorithmic problems, often delivering the best possible answers.
Documentation: Delivers comprehensive, well-structured documentation.

Weaknesses:

Dynamic Applications: Struggles with complex, interactive projects like full-stack applications and 3D games, occasionally unable to resolve functionality issues.
Advanced Error Handling: Simplified debugging lacks the robustness of Claude's approach, making it less suited for complex, user-centered error handling.
Library Utilization: Does not demonstrate the same level of skillful library use as Claude, particularly in machine learning and data science applications.

‍

Claude 3.5 Sonnet

Strengths:

Dynamic Coding Applications: Excels in handling complex projects, like full-stack applications and 3D games, with a high success rate on the first attempt.
Debugging and Error Handling: Employs advanced error handling strategies that offer more robust solutions, addressing edge cases and improving user experience.
Library Mastery: Demonstrates skillful use of libraries and frameworks, especially in machine learning, including visualizations and feature scaling.
Refactoring and Optimization: Delivers more maintainable and readable code with enums, type hints, and static methods for improved structure and reliability.

Weaknesses:

Algorithmic Challenges: Occasionally provides less optimized solutions for algorithmic and math-focused tasks.
Documentation Details: Although it creates solid documentation, it sometimes omits minor details, such as version requirements, that impact usability

‍

Best Use Cases

When to Use Qwen 2.5:

Algorithm Development and Math Tasks: Ideal for tasks that require computational precision, algorithm optimization, and problem-solving in structured scenarios.
Basic Debugging and Documentation Needs: Suitable for projects that prioritize straightforward debugging and require clear documentation without extensive setup details.
Optimized Solutions: Well-suited for projects where code efficiency and clarity are more critical than interactive features.

When to Use Claude 3.5 Sonnet:

Dynamic Projects and Full-Stack Development: Ideal for complex, user-centered applications such as web or 3D game development.
Advanced Debugging and Error Handling: The best choice for projects needing robust handling of user input and error prevention.
Library and Framework Utilization: Perfect for tasks involving up-to-date libraries, data processing, or machine learning applications, where library mastery can significantly enhance output quality.

Get API Key